xref: /linux/Documentation/core-api/real-time/differences.rst (revision 8b00d6fe96960aaba1b923d4a8c1ddb173c9c1ff)
1.. SPDX-License-Identifier: GPL-2.0
2
3===========================
4How realtime kernels differ
5===========================
6
7:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
8
9Preface
10=======
11
12With forced-threaded interrupts and sleeping spin locks, code paths that
13previously caused long scheduling latencies have been made preemptible and
14moved into process context. This allows the scheduler to manage them more
15effectively and respond to higher-priority tasks with reduced latency.
16
17The following chapters provide an overview of key differences between a
18PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel.
19
20Locking
21=======
22
23Spinning locks such as spinlock_t are used to provide synchronization for data
24structures accessed from both interrupt context and process context. For this
25reason, locking functions are also available with the _irq() or _irqsave()
26suffixes, which disable interrupts before acquiring the lock. This ensures that
27the lock can be safely acquired in process context when interrupts are enabled.
28
29However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer
30run in hard IRQ context. As a result, there is no need to disable interrupts as
31part of the locking procedure when using spinlock_t.
32
33For low-level core components such as interrupt handling, the scheduler, or the
34timer subsystem the kernel uses raw_spinlock_t. This lock type preserves
35traditional semantics: it disables preemption and, when used with _irq() or
36_irqsave(), also disables interrupts. This ensures proper synchronization in
37critical sections that must remain non-preemptible or with interrupts disabled.
38
39Execution context
40=================
41
42Interrupt handling in a PREEMPT_RT system is invoked in process context through
43the use of threaded interrupts. Other parts of the kernel also shift their
44execution into threaded context by different mechanisms. The goal is to keep
45execution paths preemptible, allowing the scheduler to interrupt them when a
46higher-priority task needs to run.
47
48Below is an overview of the kernel subsystems involved in this transition to
49threaded, preemptible execution.
50
51Interrupt handling
52------------------
53
54All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are
55interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or
56IRQF_ONESHOT flags.
57
58The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those
59registered using request_threaded_irq() and providing only a threaded handler.
60Its purpose is to keep the interrupt line masked until the threaded handler has
61completed.
62
63If a primary handler is also provided in this case, it is essential that the
64handler does not acquire any sleeping locks, as it will not be threaded. The
65handler should be minimal and must avoid introducing delays, such as
66busy-waiting on hardware registers.
67
68
69Soft interrupts, bottom half handling
70-------------------------------------
71
72Soft interrupts are raised by the interrupt handler and are executed after the
73handler returns. Since they run in thread context, they can be preempted by
74other threads. Do not assume that softirq context runs with preemption
75disabled. This means you must not rely on mechanisms like local_bh_disable() in
76process context to protect per-CPU variables. Because softirq handlers are
77preemptible under PREEMPT_RT, this approach does not provide reliable
78synchronization.
79
80If this kind of protection is required for performance reasons, consider using
81local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to
82verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the
83necessary locking to ensure proper protection.
84
85Using local_lock_nested_bh() also makes the locking scope explicit and easier
86for readers and maintainers to understand.
87
88
89per-CPU variables
90-----------------
91
92Protecting access to per-CPU variables solely by using preempt_disable() should
93be avoided, especially if the critical section has unbounded runtime or may
94call APIs that can sleep.
95
96If using a spinlock_t is considered too costly for performance reasons,
97consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces
98no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies
99that the lock is only acquired in process context and never from softirq or
100hard IRQ context.
101
102On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t,
103which provides safe local protection for per-CPU data while keeping the system
104preemptible.
105
106Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used
107to protect per-CPU data by relying on implicit preemption disabling. If this
108inherited preemption disabling is essential and if local_lock_t cannot be used
109due to performance constraints, brevity of the code, or abstraction boundaries
110within an API then preempt_disable_nested() may be a suitable alternative. On
111non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already
112disabled. On PREEMPT_RT, it explicitly disables preemption.
113
114Timers
115------
116
117By default, an hrtimer is executed in hard interrupt context. The exception is
118timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in
119softirq context.
120
121On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in
122softirq context by default, typically within the ktimersd thread. This thread
123runs at the lowest real-time priority, ensuring it executes before any
124SCHED_OTHER tasks but does not interfere with higher-priority real-time
125threads. To explicitly request execution in hard interrupt context on
126PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag.
127
128Memory allocation
129-----------------
130
131The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
132gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
133necessary to use GFP_ATOMIC when allocating memory from interrupt context or
134from sections where preemption is disabled. This is because the allocator must
135not sleep in these contexts waiting for memory to become available.
136
137However, this approach does not work on PREEMPT_RT kernels. The memory
138allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
139acquired when preemption is disabled. Fortunately, this is generally not a
140problem, because PREEMPT_RT moves most contexts that would traditionally run
141with preemption or interrupts disabled into threaded context, where sleeping is
142allowed.
143
144What remains problematic is code that explicitly disables preemption or
145interrupts. In such cases, memory allocation must be performed outside the
146critical section.
147
148This restriction also applies to memory deallocation routines such as kfree()
149and free_pages(), which may also involve internal locking and must not be
150called from non-preemptible contexts.
151
152IRQ work
153--------
154
155The irq_work API provides a mechanism to schedule a callback in interrupt
156context. It is designed for use in contexts where traditional scheduling is not
157possible, such as from within NMI handlers or from inside the scheduler, where
158using a workqueue would be unsafe.
159
160On non-PREEMPT_RT systems, all irq_work items are executed immediately in
161interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next
162timer tick but are still executed in interrupt context.
163
164On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks
165may acquire sleeping locks or have unbounded execution time, they are handled
166in thread context by a per-CPU irq_work kernel thread. This thread runs at the
167lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks
168but does not interfere with higher-priority real-time threads.
169
170The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still
171executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be
172deferred until the next timer tick and are also executed by the irq_work/
173thread.
174
175RCU callbacks
176-------------
177
178RCU callbacks are invoked by default in softirq context. Their execution is
179important because, depending on the use case, they either free memory or ensure
180progress in state transitions. Running these callbacks as part of the softirq
181chain can lead to undesired situations, such as contention for CPU resources
182with other SCHED_OTHER tasks when executed within ksoftirqd.
183
184To avoid running callbacks in softirq context, the RCU subsystem provides a
185mechanism to execute them in process context instead. This behavior can be
186enabled by setting the boot command-line parameter rcutree.use_softirq=0. This
187setting is enforced in kernels configured with PREEMPT_RT.
188
189Spin until ready
190================
191
192The "spin until ready" pattern involves repeatedly checking (spinning on) the
193state of a data structure until it becomes available. This pattern assumes that
194preemption, soft interrupts, or interrupts are disabled. If the data structure
195is marked busy, it is presumed to be in use by another CPU, and spinning should
196eventually succeed as that CPU makes progress.
197
198Some examples are hrtimer_cancel() or timer_delete_sync(). These functions
199cancel timers that execute with interrupts or soft interrupts disabled. If a
200thread attempts to cancel a timer and finds it active, spinning until the
201callback completes is safe because the callback can only run on another CPU and
202will eventually finish.
203
204On PREEMPT_RT kernels, however, timer callbacks run in thread context. This
205introduces a challenge: a higher-priority thread attempting to cancel the timer
206may preempt the timer callback thread. Since the scheduler cannot migrate the
207callback thread to another CPU due to affinity constraints, spinning can result
208in livelock even on multiprocessor systems.
209
210To avoid this, both the canceling and callback sides must use a handshake
211mechanism that supports priority inheritance. This allows the canceling thread
212to suspend until the callback completes, ensuring forward progress without
213risking livelock.
214
215In order to solve the problem at the API level, the sequence locks were extended
216to allow a proper handover between the the spinning reader and the maybe
217blocked writer.
218
219Sequence locks
220--------------
221
222Sequence counters and sequential locks are documented in
223Documentation/locking/seqlock.rst.
224
225The interface has been extended to ensure proper preemption states for the
226writer and spinning reader contexts. This is achieved by embedding the writer
227serialization lock directly into the sequence counter type, resulting in
228composite types such as seqcount_spinlock_t or seqcount_mutex_t.
229
230These composite types allow readers to detect an ongoing write and actively
231boost the writer’s priority to help it complete its update instead of spinning
232and waiting for its completion.
233
234If the plain seqcount_t is used, extra care must be taken to synchronize the
235reader with the writer during updates. The writer must ensure its update is
236serialized and non-preemptible relative to the reader. This cannot be achieved
237using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable
238preemption. In such cases, using seqcount_spinlock_t is the preferred solution.
239
240However, if there is no spinning involved i.e., if the reader only needs to
241detect whether a write has started and not serialize against it then using
242seqcount_t is reasonable.
243