1.\" Copyright (c) 2000-2001 John H. Baldwin <jhb@FreeBSD.org> 2.\" 3.\" Redistribution and use in source and binary forms, with or without 4.\" modification, are permitted provided that the following conditions 5.\" are met: 6.\" 1. Redistributions of source code must retain the above copyright 7.\" notice, this list of conditions and the following disclaimer. 8.\" 2. Redistributions in binary form must reproduce the above copyright 9.\" notice, this list of conditions and the following disclaimer in the 10.\" documentation and/or other materials provided with the distribution. 11.\" 12.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR 13.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES 14.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. 15.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT, 16.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 17.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 18.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 19.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 20.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 21.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 22.\" 23.\" $FreeBSD$ 24.\" 25.Dd November 3, 2000 26.Dt SCHEDULER 9 27.Os 28.Sh NAME 29.Nm curpriority_cmp , 30.Nm maybe_resched , 31.Nm resetpriority , 32.Nm roundrobin , 33.Nm roundrobin_interval , 34.Nm sched_setup , 35.Nm schedclock , 36.Nm schedcpu , 37.Nm setrunnable , 38.Nm updatepri 39.Nd perform round-robin scheduling of runnable processes 40.Sh SYNOPSIS 41.In sys/param.h 42.In sys/proc.h 43.Ft int 44.Fn curpriority_cmp "struct proc *p" 45.Ft void 46.Fn maybe_resched "struct thread *td" 47.Ft void 48.Fn propagate_priority "struct proc *p" 49.Ft void 50.Fn resetpriority "struct ksegrp *kg" 51.Ft void 52.Fn roundrobin "void *arg" 53.Ft int 54.Fn roundrobin_interval "void" 55.Ft void 56.Fn sched_setup "void *dummy" 57.Ft void 58.Fn schedclock "struct thread *td" 59.Ft void 60.Fn schedcpu "void *arg" 61.Ft void 62.Fn setrunnable "struct thread *td" 63.Ft void 64.Fn updatepri "struct thread *td" 65.Sh DESCRIPTION 66Each process has three different priorities stored in 67.Vt "struct proc" : 68.Va p_usrpri , 69.Va p_nativepri , 70and 71.Va p_priority . 72.Pp 73The 74.Va p_usrpri 75member is the user priority of the process calculated from a process' 76estimated CPU time and nice level. 77.Pp 78The 79.Va p_nativepri 80member is the saved priority used by 81.Fn propagate_priority . 82When a process obtains a mutex, its priority is saved in 83.Va p_nativepri . 84While it holds the mutex, the process's priority may be bumped by another 85process that blocks on the mutex. 86When the process releases the mutex, then its priority is restored to the 87priority saved in 88.Va p_nativepri . 89.Pp 90The 91.Va p_priority 92member is the actual priority of the process and is used to determine what 93.Xr runqueue 9 94it runs on, for example. 95.Pp 96The 97.Fn curpriority_cmp 98function compares the cached priority of the currently running process with 99process 100.Fa p . 101If the currently running process has a higher priority, then it will return 102a value less than zero. 103If the current process has a lower priority, then it will return a value 104greater than zero. 105If the current process has the same priority as 106.Fa p , 107then 108.Fn curpriority_cmp 109will return zero. 110The cached priority of the currently running process is updated when a process 111resumes from 112.Xr tsleep 9 113or returns to userland in 114.Fn userret 115and is stored in the private variable 116.Va curpriority . 117.Pp 118The 119.Fn maybe_resched 120function compares the priorities of the current thread and 121.Fa td . 122If 123.Fa td 124has a higher priority than the current thread, then a context switch is 125needed, and 126.Dv KEF_NEEDRESCHED 127is set. 128.Pp 129The 130.Fn propagate_priority 131looks at the process that owns the mutex 132.Fa p 133is blocked on. 134That process's priority is bumped to the priority of 135.Fa p 136if needed. 137If the process is currently running, then the function returns. 138If the process is on a 139.Xr runqueue 9 , 140then the process is moved to the appropriate 141.Xr runqueue 9 142for its new priority. 143If the process is blocked on a mutex, its position in the list of 144processes blocked on the mutex in question is updated to reflect its new 145priority. 146Then, the function repeats the procedure using the process that owns the 147mutex just encountered. 148Note that a process's priorities are only bumped to the priority of the 149original process 150.Fa p , 151not to the priority of the previously encountered process. 152.Pp 153The 154.Fn resetpriority 155function recomputes the user priority of the ksegrp 156.Fa kg 157(stored in 158.Va kg_user_pri ) 159and calls 160.Fn maybe_resched 161to force a reschedule of each thread in the group if needed. 162.Pp 163The 164.Fn roundrobin 165function is used as a 166.Xr timeout 9 167function to force a reschedule every 168.Va sched_quantum 169ticks. 170.Pp 171The 172.Fn roundrobin_interval 173function simply returns the number of clock ticks in between reschedules 174triggered by 175.Fn roundrobin . 176Thus, all it does is return the current value of 177.Va sched_quantum . 178.Pp 179The 180.Fn sched_setup 181function is a 182.Xr SYSINIT 9 183that is called to start the callout driven scheduler functions. 184It just calls the 185.Fn roundrobin 186and 187.Fn schedcpu 188functions for the first time. 189After the initial call, the two functions will propagate themselves by 190registering their callout event again at the completion of the respective 191function. 192.Pp 193The 194.Fn schedclock 195function is called by 196.Fn statclock 197to adjust the priority of the currently running thread's ksegrp. 198It updates the group's estimated CPU time and then adjusts the priority via 199.Fn resetpriority . 200.Pp 201The 202.Fn schedcpu 203function updates all process priorities. 204First, it updates statistics that track how long processes have been in various 205process states. 206Secondly, it updates the estimated CPU time for the current process such 207that about 90% of the CPU usage is forgotten in 5 * load average seconds. 208For example, if the load average is 2.00, 209then at least 90% of the estimated CPU time for the process should be based 210on the amount of CPU time the process has had in the last 10 seconds. 211It then recomputes the priority of the process and moves it to the 212appropriate 213.Xr runqueue 9 214if necessary. 215Thirdly, it updates the %CPU estimate used by utilities such as 216.Xr ps 1 217and 218.Xr top 1 219so that 95% of the CPU usage is forgotten in 60 seconds. 220Once all process priorities have been updated, 221.Fn schedcpu 222calls 223.Fn vmmeter 224to update various other statistics including the load average. 225Finally, it schedules itself to run again in 226.Va hz 227clock ticks. 228.Pp 229The 230.Fn setrunnable 231function is used to change a process's state to be runnable. 232The process is placed on a 233.Xr runqueue 9 234if needed, and the swapper process is woken up and told to swap the process in 235if the process is swapped out. 236If the process has been asleep for at least one run of 237.Fn schedcpu , 238then 239.Fn updatepri 240is used to adjust the priority of the process. 241.Pp 242The 243.Fn updatepri 244function is used to adjust the priority of a process that has been asleep. 245It retroactively decays the estimated CPU time of the process for each 246.Fn schedcpu 247event that the process was asleep. 248Finally, it calls 249.Fn resetpriority 250to adjust the priority of the process. 251.Sh SEE ALSO 252.Xr mi_switch 9 , 253.Xr runqueue 9 , 254.Xr sleepqueue 9 , 255.Xr tsleep 9 256.Sh BUGS 257The 258.Va curpriority 259variable really should be per-CPU. 260In addition, 261.Fn maybe_resched 262should compare the priority of 263.Fa chk 264with that of each CPU, and then send an IPI to the processor with the lowest 265priority to trigger a reschedule if needed. 266.Pp 267Priority propagation is broken and is thus disabled by default. 268The 269.Va p_nativepri 270variable is only updated if a process does not obtain a sleep mutex on the 271first try. 272Also, if a process obtains more than one sleep mutex in this manner, and 273had its priority bumped in between, then 274.Va p_nativepri 275will be clobbered. 276