1.\" Copyright (c) 2000-2001 John H. Baldwin <jhb@FreeBSD.org> 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR 14.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES 15.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. 16.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT, 17.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 18.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 19.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 20.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 21.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 22.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 23.\" 24.\" $FreeBSD$ 25.\" 26.Dd November 3, 2000 27.Dt SCHEDULER 9 28.Os 29.Sh NAME 30.Nm curpriority_cmp , 31.Nm maybe_resched , 32.Nm resetpriority , 33.Nm roundrobin , 34.Nm roundrobin_interval , 35.Nm sched_setup , 36.Nm schedclock , 37.Nm schedcpu , 38.Nm setrunnable , 39.Nm updatepri 40.Nd perform round-robin scheduling of runnable processes 41.Sh SYNOPSIS 42.In sys/param.h 43.In sys/proc.h 44.Ft int 45.Fn curpriority_cmp "struct proc *p" 46.Ft void 47.Fn maybe_resched "struct thread *td" 48.Ft void 49.Fn propagate_priority "struct proc *p" 50.Ft void 51.Fn resetpriority "struct ksegrp *kg" 52.Ft void 53.Fn roundrobin "void *arg" 54.Ft int 55.Fn roundrobin_interval "void" 56.Ft void 57.Fn sched_setup "void *dummy" 58.Ft void 59.Fn schedclock "struct thread *td" 60.Ft void 61.Fn schedcpu "void *arg" 62.Ft void 63.Fn setrunnable "struct thread *td" 64.Ft void 65.Fn updatepri "struct thread *td" 66.Sh DESCRIPTION 67Each process has three different priorities stored in 68.Vt "struct proc" : 69.Va p_usrpri , 70.Va p_nativepri , 71and 72.Va p_priority . 73.Pp 74The 75.Va p_usrpri 76member is the user priority of the process calculated from a process' 77estimated CPU time and nice level. 78.Pp 79The 80.Va p_nativepri 81member is the saved priority used by 82.Fn propagate_priority . 83When a process obtains a mutex, its priority is saved in 84.Va p_nativepri . 85While it holds the mutex, the process's priority may be bumped by another 86process that blocks on the mutex. 87When the process releases the mutex, then its priority is restored to the 88priority saved in 89.Va p_nativepri . 90.Pp 91The 92.Va p_priority 93member is the actual priority of the process and is used to determine what 94.Xr runqueue 9 95it runs on, for example. 96.Pp 97The 98.Fn curpriority_cmp 99function compares the cached priority of the currently running process with 100process 101.Fa p . 102If the currently running process has a higher priority, then it will return 103a value less than zero. 104If the current process has a lower priority, then it will return a value 105greater than zero. 106If the current process has the same priority as 107.Fa p , 108then 109.Fn curpriority_cmp 110will return zero. 111The cached priority of the currently running process is updated when a process 112resumes from 113.Xr tsleep 9 114or returns to userland in 115.Fn userret 116and is stored in the private variable 117.Va curpriority . 118.Pp 119The 120.Fn maybe_resched 121function compares the priorities of the current thread and 122.Fa td . 123If 124.Fa td 125has a higher priority than the current thread, then a context switch is 126needed, and 127.Dv KEF_NEEDRESCHED 128is set. 129.Pp 130The 131.Fn propagate_priority 132looks at the process that owns the mutex 133.Fa p 134is blocked on. 135That process's priority is bumped to the priority of 136.Fa p 137if needed. 138If the process is currently running, then the function returns. 139If the process is on a 140.Xr runqueue 9 , 141then the process is moved to the appropriate 142.Xr runqueue 9 143for its new priority. 144If the process is blocked on a mutex, its position in the list of 145processes blocked on the mutex in question is updated to reflect its new 146priority. 147Then, the function repeats the procedure using the process that owns the 148mutex just encountered. 149Note that a process's priorities are only bumped to the priority of the 150original process 151.Fa p , 152not to the priority of the previously encountered process. 153.Pp 154The 155.Fn resetpriority 156function recomputes the user priority of the ksegrp 157.Fa kg 158(stored in 159.Va kg_user_pri ) 160and calls 161.Fn maybe_resched 162to force a reschedule of each thread in the group if needed. 163.Pp 164The 165.Fn roundrobin 166function is used as a 167.Xr timeout 9 168function to force a reschedule every 169.Va sched_quantum 170ticks. 171.Pp 172The 173.Fn roundrobin_interval 174function simply returns the number of clock ticks in between reschedules 175triggered by 176.Fn roundrobin . 177Thus, all it does is return the current value of 178.Va sched_quantum . 179.Pp 180The 181.Fn sched_setup 182function is a 183.Xr SYSINIT 9 184that is called to start the callout driven scheduler functions. 185It just calls the 186.Fn roundrobin 187and 188.Fn schedcpu 189functions for the first time. 190After the initial call, the two functions will propagate themselves by 191registering their callout event again at the completion of the respective 192function. 193.Pp 194The 195.Fn schedclock 196function is called by 197.Fn statclock 198to adjust the priority of the currently running thread's ksegrp. 199It updates the group's estimated CPU time and then adjusts the priority via 200.Fn resetpriority . 201.Pp 202The 203.Fn schedcpu 204function updates all process priorities. 205First, it updates statistics that track how long processes have been in various 206process states. 207Secondly, it updates the estimated CPU time for the current process such 208that about 90% of the CPU usage is forgotten in 5 * load average seconds. 209For example, if the load average is 2.00, 210then at least 90% of the estimated CPU time for the process should be based 211on the amount of CPU time the process has had in the last 10 seconds. 212It then recomputes the priority of the process and moves it to the 213appropriate 214.Xr runqueue 9 215if necessary. 216Thirdly, it updates the %CPU estimate used by utilities such as 217.Xr ps 1 218and 219.Xr top 1 220so that 95% of the CPU usage is forgotten in 60 seconds. 221Once all process priorities have been updated, 222.Fn schedcpu 223calls 224.Fn vmmeter 225to update various other statistics including the load average. 226Finally, it schedules itself to run again in 227.Va hz 228clock ticks. 229.Pp 230The 231.Fn setrunnable 232function is used to change a process's state to be runnable. 233The process is placed on a 234.Xr runqueue 9 235if needed, and the swapper process is woken up and told to swap the process in 236if the process is swapped out. 237If the process has been asleep for at least one run of 238.Fn schedcpu , 239then 240.Fn updatepri 241is used to adjust the priority of the process. 242.Pp 243The 244.Fn updatepri 245function is used to adjust the priority of a process that has been asleep. 246It retroactively decays the estimated CPU time of the process for each 247.Fn schedcpu 248event that the process was asleep. 249Finally, it calls 250.Fn resetpriority 251to adjust the priority of the process. 252.Sh SEE ALSO 253.Xr mi_switch 9 , 254.Xr runqueue 9 , 255.Xr sleepqueue 9 , 256.Xr tsleep 9 257.Sh BUGS 258The 259.Va curpriority 260variable really should be per-CPU. 261In addition, 262.Fn maybe_resched 263should compare the priority of 264.Fa chk 265with that of each CPU, and then send an IPI to the processor with the lowest 266priority to trigger a reschedule if needed. 267.Pp 268Priority propagation is broken and is thus disabled by default. 269The 270.Va p_nativepri 271variable is only updated if a process does not obtain a sleep mutex on the 272first try. 273Also, if a process obtains more than one sleep mutex in this manner, and 274had its priority bumped in between, then 275.Va p_nativepri 276will be clobbered. 277