xref: /linux/Documentation/core-api/real-time/differences.rst (revision 4f38da1f027ea2c9f01bb71daa7a299c191b6940)
1*f51fe3b7SSebastian Andrzej Siewior.. SPDX-License-Identifier: GPL-2.0
2*f51fe3b7SSebastian Andrzej Siewior
3*f51fe3b7SSebastian Andrzej Siewior===========================
4*f51fe3b7SSebastian Andrzej SiewiorHow realtime kernels differ
5*f51fe3b7SSebastian Andrzej Siewior===========================
6*f51fe3b7SSebastian Andrzej Siewior
7*f51fe3b7SSebastian Andrzej Siewior:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
8*f51fe3b7SSebastian Andrzej Siewior
9*f51fe3b7SSebastian Andrzej SiewiorPreface
10*f51fe3b7SSebastian Andrzej Siewior=======
11*f51fe3b7SSebastian Andrzej Siewior
12*f51fe3b7SSebastian Andrzej SiewiorWith forced-threaded interrupts and sleeping spin locks, code paths that
13*f51fe3b7SSebastian Andrzej Siewiorpreviously caused long scheduling latencies have been made preemptible and
14*f51fe3b7SSebastian Andrzej Siewiormoved into process context. This allows the scheduler to manage them more
15*f51fe3b7SSebastian Andrzej Siewioreffectively and respond to higher-priority tasks with reduced latency.
16*f51fe3b7SSebastian Andrzej Siewior
17*f51fe3b7SSebastian Andrzej SiewiorThe following chapters provide an overview of key differences between a
18*f51fe3b7SSebastian Andrzej SiewiorPREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel.
19*f51fe3b7SSebastian Andrzej Siewior
20*f51fe3b7SSebastian Andrzej SiewiorLocking
21*f51fe3b7SSebastian Andrzej Siewior=======
22*f51fe3b7SSebastian Andrzej Siewior
23*f51fe3b7SSebastian Andrzej SiewiorSpinning locks such as spinlock_t are used to provide synchronization for data
24*f51fe3b7SSebastian Andrzej Siewiorstructures accessed from both interrupt context and process context. For this
25*f51fe3b7SSebastian Andrzej Siewiorreason, locking functions are also available with the _irq() or _irqsave()
26*f51fe3b7SSebastian Andrzej Siewiorsuffixes, which disable interrupts before acquiring the lock. This ensures that
27*f51fe3b7SSebastian Andrzej Siewiorthe lock can be safely acquired in process context when interrupts are enabled.
28*f51fe3b7SSebastian Andrzej Siewior
29*f51fe3b7SSebastian Andrzej SiewiorHowever, on a PREEMPT_RT system, interrupts are forced-threaded and no longer
30*f51fe3b7SSebastian Andrzej Siewiorrun in hard IRQ context. As a result, there is no need to disable interrupts as
31*f51fe3b7SSebastian Andrzej Siewiorpart of the locking procedure when using spinlock_t.
32*f51fe3b7SSebastian Andrzej Siewior
33*f51fe3b7SSebastian Andrzej SiewiorFor low-level core components such as interrupt handling, the scheduler, or the
34*f51fe3b7SSebastian Andrzej Siewiortimer subsystem the kernel uses raw_spinlock_t. This lock type preserves
35*f51fe3b7SSebastian Andrzej Siewiortraditional semantics: it disables preemption and, when used with _irq() or
36*f51fe3b7SSebastian Andrzej Siewior_irqsave(), also disables interrupts. This ensures proper synchronization in
37*f51fe3b7SSebastian Andrzej Siewiorcritical sections that must remain non-preemptible or with interrupts disabled.
38*f51fe3b7SSebastian Andrzej Siewior
39*f51fe3b7SSebastian Andrzej SiewiorExecution context
40*f51fe3b7SSebastian Andrzej Siewior=================
41*f51fe3b7SSebastian Andrzej Siewior
42*f51fe3b7SSebastian Andrzej SiewiorInterrupt handling in a PREEMPT_RT system is invoked in process context through
43*f51fe3b7SSebastian Andrzej Siewiorthe use of threaded interrupts. Other parts of the kernel also shift their
44*f51fe3b7SSebastian Andrzej Siewiorexecution into threaded context by different mechanisms. The goal is to keep
45*f51fe3b7SSebastian Andrzej Siewiorexecution paths preemptible, allowing the scheduler to interrupt them when a
46*f51fe3b7SSebastian Andrzej Siewiorhigher-priority task needs to run.
47*f51fe3b7SSebastian Andrzej Siewior
48*f51fe3b7SSebastian Andrzej SiewiorBelow is an overview of the kernel subsystems involved in this transition to
49*f51fe3b7SSebastian Andrzej Siewiorthreaded, preemptible execution.
50*f51fe3b7SSebastian Andrzej Siewior
51*f51fe3b7SSebastian Andrzej SiewiorInterrupt handling
52*f51fe3b7SSebastian Andrzej Siewior------------------
53*f51fe3b7SSebastian Andrzej Siewior
54*f51fe3b7SSebastian Andrzej SiewiorAll interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are
55*f51fe3b7SSebastian Andrzej Siewiorinterrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or
56*f51fe3b7SSebastian Andrzej SiewiorIRQF_ONESHOT flags.
57*f51fe3b7SSebastian Andrzej Siewior
58*f51fe3b7SSebastian Andrzej SiewiorThe IRQF_ONESHOT flag is used together with threaded interrupts, meaning those
59*f51fe3b7SSebastian Andrzej Siewiorregistered using request_threaded_irq() and providing only a threaded handler.
60*f51fe3b7SSebastian Andrzej SiewiorIts purpose is to keep the interrupt line masked until the threaded handler has
61*f51fe3b7SSebastian Andrzej Siewiorcompleted.
62*f51fe3b7SSebastian Andrzej Siewior
63*f51fe3b7SSebastian Andrzej SiewiorIf a primary handler is also provided in this case, it is essential that the
64*f51fe3b7SSebastian Andrzej Siewiorhandler does not acquire any sleeping locks, as it will not be threaded. The
65*f51fe3b7SSebastian Andrzej Siewiorhandler should be minimal and must avoid introducing delays, such as
66*f51fe3b7SSebastian Andrzej Siewiorbusy-waiting on hardware registers.
67*f51fe3b7SSebastian Andrzej Siewior
68*f51fe3b7SSebastian Andrzej Siewior
69*f51fe3b7SSebastian Andrzej SiewiorSoft interrupts, bottom half handling
70*f51fe3b7SSebastian Andrzej Siewior-------------------------------------
71*f51fe3b7SSebastian Andrzej Siewior
72*f51fe3b7SSebastian Andrzej SiewiorSoft interrupts are raised by the interrupt handler and are executed after the
73*f51fe3b7SSebastian Andrzej Siewiorhandler returns. Since they run in thread context, they can be preempted by
74*f51fe3b7SSebastian Andrzej Siewiorother threads. Do not assume that softirq context runs with preemption
75*f51fe3b7SSebastian Andrzej Siewiordisabled. This means you must not rely on mechanisms like local_bh_disable() in
76*f51fe3b7SSebastian Andrzej Siewiorprocess context to protect per-CPU variables. Because softirq handlers are
77*f51fe3b7SSebastian Andrzej Siewiorpreemptible under PREEMPT_RT, this approach does not provide reliable
78*f51fe3b7SSebastian Andrzej Siewiorsynchronization.
79*f51fe3b7SSebastian Andrzej Siewior
80*f51fe3b7SSebastian Andrzej SiewiorIf this kind of protection is required for performance reasons, consider using
81*f51fe3b7SSebastian Andrzej Siewiorlocal_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to
82*f51fe3b7SSebastian Andrzej Siewiorverify that bottom halves are disabled. On PREEMPT_RT systems, it adds the
83*f51fe3b7SSebastian Andrzej Siewiornecessary locking to ensure proper protection.
84*f51fe3b7SSebastian Andrzej Siewior
85*f51fe3b7SSebastian Andrzej SiewiorUsing local_lock_nested_bh() also makes the locking scope explicit and easier
86*f51fe3b7SSebastian Andrzej Siewiorfor readers and maintainers to understand.
87*f51fe3b7SSebastian Andrzej Siewior
88*f51fe3b7SSebastian Andrzej Siewior
89*f51fe3b7SSebastian Andrzej Siewiorper-CPU variables
90*f51fe3b7SSebastian Andrzej Siewior-----------------
91*f51fe3b7SSebastian Andrzej Siewior
92*f51fe3b7SSebastian Andrzej SiewiorProtecting access to per-CPU variables solely by using preempt_disable() should
93*f51fe3b7SSebastian Andrzej Siewiorbe avoided, especially if the critical section has unbounded runtime or may
94*f51fe3b7SSebastian Andrzej Siewiorcall APIs that can sleep.
95*f51fe3b7SSebastian Andrzej Siewior
96*f51fe3b7SSebastian Andrzej SiewiorIf using a spinlock_t is considered too costly for performance reasons,
97*f51fe3b7SSebastian Andrzej Siewiorconsider using local_lock_t. On non-PREEMPT_RT configurations, this introduces
98*f51fe3b7SSebastian Andrzej Siewiorno runtime overhead when lockdep is disabled. With lockdep enabled, it verifies
99*f51fe3b7SSebastian Andrzej Siewiorthat the lock is only acquired in process context and never from softirq or
100*f51fe3b7SSebastian Andrzej Siewiorhard IRQ context.
101*f51fe3b7SSebastian Andrzej Siewior
102*f51fe3b7SSebastian Andrzej SiewiorOn a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t,
103*f51fe3b7SSebastian Andrzej Siewiorwhich provides safe local protection for per-CPU data while keeping the system
104*f51fe3b7SSebastian Andrzej Siewiorpreemptible.
105*f51fe3b7SSebastian Andrzej Siewior
106*f51fe3b7SSebastian Andrzej SiewiorBecause spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used
107*f51fe3b7SSebastian Andrzej Siewiorto protect per-CPU data by relying on implicit preemption disabling. If this
108*f51fe3b7SSebastian Andrzej Siewiorinherited preemption disabling is essential and if local_lock_t cannot be used
109*f51fe3b7SSebastian Andrzej Siewiordue to performance constraints, brevity of the code, or abstraction boundaries
110*f51fe3b7SSebastian Andrzej Siewiorwithin an API then preempt_disable_nested() may be a suitable alternative. On
111*f51fe3b7SSebastian Andrzej Siewiornon-PREEMPT_RT kernels, it verifies with lockdep that preemption is already
112*f51fe3b7SSebastian Andrzej Siewiordisabled. On PREEMPT_RT, it explicitly disables preemption.
113*f51fe3b7SSebastian Andrzej Siewior
114*f51fe3b7SSebastian Andrzej SiewiorTimers
115*f51fe3b7SSebastian Andrzej Siewior------
116*f51fe3b7SSebastian Andrzej Siewior
117*f51fe3b7SSebastian Andrzej SiewiorBy default, an hrtimer is executed in hard interrupt context. The exception is
118*f51fe3b7SSebastian Andrzej Siewiortimers initialized with the HRTIMER_MODE_SOFT flag, which are executed in
119*f51fe3b7SSebastian Andrzej Siewiorsoftirq context.
120*f51fe3b7SSebastian Andrzej Siewior
121*f51fe3b7SSebastian Andrzej SiewiorOn a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in
122*f51fe3b7SSebastian Andrzej Siewiorsoftirq context by default, typically within the ktimersd thread. This thread
123*f51fe3b7SSebastian Andrzej Siewiorruns at the lowest real-time priority, ensuring it executes before any
124*f51fe3b7SSebastian Andrzej SiewiorSCHED_OTHER tasks but does not interfere with higher-priority real-time
125*f51fe3b7SSebastian Andrzej Siewiorthreads. To explicitly request execution in hard interrupt context on
126*f51fe3b7SSebastian Andrzej SiewiorPREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag.
127*f51fe3b7SSebastian Andrzej Siewior
128*f51fe3b7SSebastian Andrzej SiewiorMemory allocation
129*f51fe3b7SSebastian Andrzej Siewior-----------------
130*f51fe3b7SSebastian Andrzej Siewior
131*f51fe3b7SSebastian Andrzej SiewiorThe memory allocation APIs, such as kmalloc() and alloc_pages(), require a
132*f51fe3b7SSebastian Andrzej Siewiorgfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
133*f51fe3b7SSebastian Andrzej Siewiornecessary to use GFP_ATOMIC when allocating memory from interrupt context or
134*f51fe3b7SSebastian Andrzej Siewiorfrom sections where preemption is disabled. This is because the allocator must
135*f51fe3b7SSebastian Andrzej Siewiornot sleep in these contexts waiting for memory to become available.
136*f51fe3b7SSebastian Andrzej Siewior
137*f51fe3b7SSebastian Andrzej SiewiorHowever, this approach does not work on PREEMPT_RT kernels. The memory
138*f51fe3b7SSebastian Andrzej Siewiorallocator in PREEMPT_RT uses sleeping locks internally, which cannot be
139*f51fe3b7SSebastian Andrzej Siewioracquired when preemption is disabled. Fortunately, this is generally not a
140*f51fe3b7SSebastian Andrzej Siewiorproblem, because PREEMPT_RT moves most contexts that would traditionally run
141*f51fe3b7SSebastian Andrzej Siewiorwith preemption or interrupts disabled into threaded context, where sleeping is
142*f51fe3b7SSebastian Andrzej Siewiorallowed.
143*f51fe3b7SSebastian Andrzej Siewior
144*f51fe3b7SSebastian Andrzej SiewiorWhat remains problematic is code that explicitly disables preemption or
145*f51fe3b7SSebastian Andrzej Siewiorinterrupts. In such cases, memory allocation must be performed outside the
146*f51fe3b7SSebastian Andrzej Siewiorcritical section.
147*f51fe3b7SSebastian Andrzej Siewior
148*f51fe3b7SSebastian Andrzej SiewiorThis restriction also applies to memory deallocation routines such as kfree()
149*f51fe3b7SSebastian Andrzej Siewiorand free_pages(), which may also involve internal locking and must not be
150*f51fe3b7SSebastian Andrzej Siewiorcalled from non-preemptible contexts.
151*f51fe3b7SSebastian Andrzej Siewior
152*f51fe3b7SSebastian Andrzej SiewiorIRQ work
153*f51fe3b7SSebastian Andrzej Siewior--------
154*f51fe3b7SSebastian Andrzej Siewior
155*f51fe3b7SSebastian Andrzej SiewiorThe irq_work API provides a mechanism to schedule a callback in interrupt
156*f51fe3b7SSebastian Andrzej Siewiorcontext. It is designed for use in contexts where traditional scheduling is not
157*f51fe3b7SSebastian Andrzej Siewiorpossible, such as from within NMI handlers or from inside the scheduler, where
158*f51fe3b7SSebastian Andrzej Siewiorusing a workqueue would be unsafe.
159*f51fe3b7SSebastian Andrzej Siewior
160*f51fe3b7SSebastian Andrzej SiewiorOn non-PREEMPT_RT systems, all irq_work items are executed immediately in
161*f51fe3b7SSebastian Andrzej Siewiorinterrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next
162*f51fe3b7SSebastian Andrzej Siewiortimer tick but are still executed in interrupt context.
163*f51fe3b7SSebastian Andrzej Siewior
164*f51fe3b7SSebastian Andrzej SiewiorOn PREEMPT_RT systems, the execution model changes. Because irq_work callbacks
165*f51fe3b7SSebastian Andrzej Siewiormay acquire sleeping locks or have unbounded execution time, they are handled
166*f51fe3b7SSebastian Andrzej Siewiorin thread context by a per-CPU irq_work kernel thread. This thread runs at the
167*f51fe3b7SSebastian Andrzej Siewiorlowest real-time priority, ensuring it executes before any SCHED_OTHER tasks
168*f51fe3b7SSebastian Andrzej Siewiorbut does not interfere with higher-priority real-time threads.
169*f51fe3b7SSebastian Andrzej Siewior
170*f51fe3b7SSebastian Andrzej SiewiorThe exception are work items marked with IRQ_WORK_HARD_IRQ, which are still
171*f51fe3b7SSebastian Andrzej Siewiorexecuted in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be
172*f51fe3b7SSebastian Andrzej Siewiordeferred until the next timer tick and are also executed by the irq_work/
173*f51fe3b7SSebastian Andrzej Siewiorthread.
174*f51fe3b7SSebastian Andrzej Siewior
175*f51fe3b7SSebastian Andrzej SiewiorRCU callbacks
176*f51fe3b7SSebastian Andrzej Siewior-------------
177*f51fe3b7SSebastian Andrzej Siewior
178*f51fe3b7SSebastian Andrzej SiewiorRCU callbacks are invoked by default in softirq context. Their execution is
179*f51fe3b7SSebastian Andrzej Siewiorimportant because, depending on the use case, they either free memory or ensure
180*f51fe3b7SSebastian Andrzej Siewiorprogress in state transitions. Running these callbacks as part of the softirq
181*f51fe3b7SSebastian Andrzej Siewiorchain can lead to undesired situations, such as contention for CPU resources
182*f51fe3b7SSebastian Andrzej Siewiorwith other SCHED_OTHER tasks when executed within ksoftirqd.
183*f51fe3b7SSebastian Andrzej Siewior
184*f51fe3b7SSebastian Andrzej SiewiorTo avoid running callbacks in softirq context, the RCU subsystem provides a
185*f51fe3b7SSebastian Andrzej Siewiormechanism to execute them in process context instead. This behavior can be
186*f51fe3b7SSebastian Andrzej Siewiorenabled by setting the boot command-line parameter rcutree.use_softirq=0. This
187*f51fe3b7SSebastian Andrzej Siewiorsetting is enforced in kernels configured with PREEMPT_RT.
188*f51fe3b7SSebastian Andrzej Siewior
189*f51fe3b7SSebastian Andrzej SiewiorSpin until ready
190*f51fe3b7SSebastian Andrzej Siewior================
191*f51fe3b7SSebastian Andrzej Siewior
192*f51fe3b7SSebastian Andrzej SiewiorThe "spin until ready" pattern involves repeatedly checking (spinning on) the
193*f51fe3b7SSebastian Andrzej Siewiorstate of a data structure until it becomes available. This pattern assumes that
194*f51fe3b7SSebastian Andrzej Siewiorpreemption, soft interrupts, or interrupts are disabled. If the data structure
195*f51fe3b7SSebastian Andrzej Siewioris marked busy, it is presumed to be in use by another CPU, and spinning should
196*f51fe3b7SSebastian Andrzej Siewioreventually succeed as that CPU makes progress.
197*f51fe3b7SSebastian Andrzej Siewior
198*f51fe3b7SSebastian Andrzej SiewiorSome examples are hrtimer_cancel() or timer_delete_sync(). These functions
199*f51fe3b7SSebastian Andrzej Siewiorcancel timers that execute with interrupts or soft interrupts disabled. If a
200*f51fe3b7SSebastian Andrzej Siewiorthread attempts to cancel a timer and finds it active, spinning until the
201*f51fe3b7SSebastian Andrzej Siewiorcallback completes is safe because the callback can only run on another CPU and
202*f51fe3b7SSebastian Andrzej Siewiorwill eventually finish.
203*f51fe3b7SSebastian Andrzej Siewior
204*f51fe3b7SSebastian Andrzej SiewiorOn PREEMPT_RT kernels, however, timer callbacks run in thread context. This
205*f51fe3b7SSebastian Andrzej Siewiorintroduces a challenge: a higher-priority thread attempting to cancel the timer
206*f51fe3b7SSebastian Andrzej Siewiormay preempt the timer callback thread. Since the scheduler cannot migrate the
207*f51fe3b7SSebastian Andrzej Siewiorcallback thread to another CPU due to affinity constraints, spinning can result
208*f51fe3b7SSebastian Andrzej Siewiorin livelock even on multiprocessor systems.
209*f51fe3b7SSebastian Andrzej Siewior
210*f51fe3b7SSebastian Andrzej SiewiorTo avoid this, both the canceling and callback sides must use a handshake
211*f51fe3b7SSebastian Andrzej Siewiormechanism that supports priority inheritance. This allows the canceling thread
212*f51fe3b7SSebastian Andrzej Siewiorto suspend until the callback completes, ensuring forward progress without
213*f51fe3b7SSebastian Andrzej Siewiorrisking livelock.
214*f51fe3b7SSebastian Andrzej Siewior
215*f51fe3b7SSebastian Andrzej SiewiorIn order to solve the problem at the API level, the sequence locks were extended
216*f51fe3b7SSebastian Andrzej Siewiorto allow a proper handover between the the spinning reader and the maybe
217*f51fe3b7SSebastian Andrzej Siewiorblocked writer.
218*f51fe3b7SSebastian Andrzej Siewior
219*f51fe3b7SSebastian Andrzej SiewiorSequence locks
220*f51fe3b7SSebastian Andrzej Siewior--------------
221*f51fe3b7SSebastian Andrzej Siewior
222*f51fe3b7SSebastian Andrzej SiewiorSequence counters and sequential locks are documented in
223*f51fe3b7SSebastian Andrzej SiewiorDocumentation/locking/seqlock.rst.
224*f51fe3b7SSebastian Andrzej Siewior
225*f51fe3b7SSebastian Andrzej SiewiorThe interface has been extended to ensure proper preemption states for the
226*f51fe3b7SSebastian Andrzej Siewiorwriter and spinning reader contexts. This is achieved by embedding the writer
227*f51fe3b7SSebastian Andrzej Siewiorserialization lock directly into the sequence counter type, resulting in
228*f51fe3b7SSebastian Andrzej Siewiorcomposite types such as seqcount_spinlock_t or seqcount_mutex_t.
229*f51fe3b7SSebastian Andrzej Siewior
230*f51fe3b7SSebastian Andrzej SiewiorThese composite types allow readers to detect an ongoing write and actively
231*f51fe3b7SSebastian Andrzej Siewiorboost the writer’s priority to help it complete its update instead of spinning
232*f51fe3b7SSebastian Andrzej Siewiorand waiting for its completion.
233*f51fe3b7SSebastian Andrzej Siewior
234*f51fe3b7SSebastian Andrzej SiewiorIf the plain seqcount_t is used, extra care must be taken to synchronize the
235*f51fe3b7SSebastian Andrzej Siewiorreader with the writer during updates. The writer must ensure its update is
236*f51fe3b7SSebastian Andrzej Siewiorserialized and non-preemptible relative to the reader. This cannot be achieved
237*f51fe3b7SSebastian Andrzej Siewiorusing a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable
238*f51fe3b7SSebastian Andrzej Siewiorpreemption. In such cases, using seqcount_spinlock_t is the preferred solution.
239*f51fe3b7SSebastian Andrzej Siewior
240*f51fe3b7SSebastian Andrzej SiewiorHowever, if there is no spinning involved i.e., if the reader only needs to
241*f51fe3b7SSebastian Andrzej Siewiordetect whether a write has started and not serialize against it then using
242*f51fe3b7SSebastian Andrzej Siewiorseqcount_t is reasonable.
243