core-api/real-time/theory.rst

.. SPDX-License-Identifier: GPL-2.0

=====================
Theory of operation
=====================

:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Preface
=======

PREEMPT_RT transforms the Linux kernel into a real-time kernel. It achieves
this by replacing locking primitives, such as spinlock_t, with a preemptible
and priority-inheritance aware implementation known as rtmutex, and by enforcing
the use of threaded interrupts. As a result, the kernel becomes fully
preemptible, with the exception of a few critical code paths, including entry
code, the scheduler, and low-level interrupt handling routines.

This transformation places the majority of kernel execution contexts under the
control of the scheduler and significantly increasing the number of preemption
points. Consequently, it reduces the latency between a high-priority task
becoming runnable and its actual execution on the CPU.

Scheduling
==========

The core principles of Linux scheduling and the associated user-space API are
documented in the man page sched(7)
`sched(7) <https://man7.org/linux/man-pages/man7/sched.7.html>`_.
By default, the Linux kernel uses the SCHED_OTHER scheduling policy. Under
this policy, a task is preempted when the scheduler determines that it has
consumed a fair share of CPU time relative to other runnable tasks. However,
the policy does not guarantee immediate preemption when a new SCHED_OTHER task
becomes runnable. The currently running task may continue executing.

This behavior differs from that of real-time scheduling policies such as
SCHED_FIFO. When a task with a real-time policy becomes runnable, the
scheduler immediately selects it for execution if it has a higher priority than
the currently running task. The task continues to run until it voluntarily
yields the CPU, typically by blocking on an event.

Sleeping spin locks
===================

The various lock types and their behavior under real-time configurations are
described in detail in Documentation/locking/locktypes.rst.
In a non-PREEMPT_RT configuration, a spinlock_t is acquired by first disabling
preemption and then actively spinning until the lock becomes available. Once
the lock is released, preemption is enabled. From a real-time perspective,
this approach is undesirable because disabling preemption prevents the
scheduler from switching to a higher-priority task, potentially increasing
latency.

To address this, PREEMPT_RT replaces spinning locks with sleeping spin locks
that do not disable preemption. On PREEMPT_RT, spinlock_t is implemented using
rtmutex. Instead of spinning, a task attempting to acquire a contended lock
disables CPU migration, donates its priority to the lock owner (priority
inheritance), and voluntarily schedules out while waiting for the lock to
become available.

Disabling CPU migration provides the same effect as disabling preemption, while
still allowing preemption and ensuring that the task continues to run on the
same CPU while holding a sleeping lock.

Priority inheritance
====================

Lock types such as spinlock_t and mutex_t in a PREEMPT_RT enabled kernel are
implemented on top of rtmutex, which provides support for priority inheritance
(PI). When a task blocks on such a lock, the PI mechanism temporarily
propagates the blocked task’s scheduling parameters to the lock owner.

For example, if a SCHED_FIFO task A blocks on a lock currently held by a
SCHED_OTHER task B, task A’s scheduling policy and priority are temporarily
inherited by task B. After this inheritance, task A is put to sleep while
waiting for the lock, and task B effectively becomes the highest-priority task
in the system. This allows B to continue executing, make progress, and
eventually release the lock.

Once B releases the lock, it reverts to its original scheduling parameters, and
task A can resume execution.

Threaded interrupts
===================

Interrupt handlers are another source of code that executes with preemption
disabled and outside the control of the scheduler. To bring interrupt handling
under scheduler control, PREEMPT_RT enforces threaded interrupt handlers.

With forced threading, interrupt handling is split into two stages. The first
stage, the primary handler, is executed in IRQ context with interrupts disabled.
Its sole responsibility is to wake the associated threaded handler. The second
stage, the threaded handler, is the function passed to request_irq() as the
interrupt handler. It runs in process context, scheduled by the kernel.

From waking the interrupt thread until threaded handling is completed, the
interrupt source is masked in the interrupt controller. This ensures that the
device interrupt remains pending but does not retrigger the CPU, allowing the
system to exit IRQ context and handle the interrupt in a scheduled thread.

By default, the threaded handler executes with the SCHED_FIFO scheduling policy
and a priority of 50 (MAX_RT_PRIO / 2), which is midway between the minimum and
maximum real-time priorities.

If the threaded interrupt handler raises any soft interrupts during its
execution, those soft interrupt routines are invoked after the threaded handler
completes, within the same thread. Preemption remains enabled during the
execution of the soft interrupt handler.

Summary
=======

By using sleeping locks and forced-threaded interrupts, PREEMPT_RT
significantly reduces sections of code where interrupts or preemption is
disabled, allowing the scheduler to preempt the current execution context and
switch to a higher-priority task.