xref: /linux/Documentation/networking/napi.rst (revision d639d9fa162aadec1ae9980c4dcf6e50bd2f8290)
1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
2
3.. _napi:
4
5====
6NAPI
7====
8
9NAPI is the event handling mechanism used by the Linux networking stack.
10The name NAPI no longer stands for anything in particular [#]_.
11
12In basic operation the device notifies the host about new events
13via an interrupt.
14The host then schedules a NAPI instance to process the events.
15The device may also be polled for events via NAPI without receiving
16interrupts first (:ref:`busy polling<poll>`).
17
18NAPI processing usually happens in the software interrupt context,
19but there is an option to use :ref:`separate kernel threads<threaded>`
20for NAPI processing.
21
22All in all NAPI abstracts away from the drivers the context and configuration
23of event (packet Rx and Tx) processing.
24
25Driver API
26==========
27
28The two most important elements of NAPI are the struct napi_struct
29and the associated poll method. struct napi_struct holds the state
30of the NAPI instance while the method is the driver-specific event
31handler. The method will typically free Tx packets that have been
32transmitted and process newly received packets.
33
34.. _drv_ctrl:
35
36Control API
37-----------
38
39netif_napi_add() and netif_napi_del() add/remove a NAPI instance
40from the system. The instances are attached to the netdevice passed
41as argument (and will be deleted automatically when netdevice is
42unregistered). Instances are added in a disabled state.
43
44napi_enable() and napi_disable() manage the disabled state.
45A disabled NAPI can't be scheduled and its poll method is guaranteed
46to not be invoked. napi_disable() waits for ownership of the NAPI
47instance to be released.
48
49The control APIs are not idempotent. Control API calls are safe against
50concurrent use of datapath APIs but an incorrect sequence of control API
51calls may result in crashes, deadlocks, or race conditions. For example,
52calling napi_disable() multiple times in a row will hang waiting for
53ownership of the NAPI instance to be released.
54
55Drivers using the netdev instance lock may need to use the ``_locked()``
56variants of the control APIs when that lock is already held.
57
58Datapath API
59------------
60
61napi_schedule() is the basic method of scheduling a NAPI poll.
62Drivers should call this function in their interrupt handler
63(see :ref:`drv_sched` for more info). A successful call to napi_schedule()
64will take ownership of the NAPI instance.
65
66Later, after NAPI is scheduled, the driver's poll method will be
67called to process the events/packets. The method takes a ``budget``
68argument - drivers can process completions for any number of Tx
69packets but should only process up to ``budget`` number of
70Rx packets. Rx processing is usually much more expensive.
71
72In other words for Rx processing the ``budget`` argument limits how many
73packets driver can process in a single poll. Rx specific APIs like page
74pool or XDP cannot be used at all when ``budget`` is 0.
75skb Tx processing should happen regardless of the ``budget``, but if
76the argument is 0 driver cannot call any XDP (or page pool) APIs.
77
78.. warning::
79
80   The ``budget`` argument may be 0 if core tries to only process
81   skb Tx completions and no Rx or XDP packets.
82
83The poll method returns the amount of work done. If the driver still
84has outstanding work to do (e.g. ``budget`` was exhausted)
85the poll method should return exactly ``budget``. In that case,
86the NAPI instance will be serviced/polled again (without the
87need to be scheduled).
88
89If event processing has been completed (all outstanding packets
90processed) the poll method should call napi_complete_done()
91before returning. napi_complete_done() releases the ownership
92of the instance.
93
94.. warning::
95
96   The case of finishing all events and using exactly ``budget``
97   must be handled carefully. There is no way to report this
98   (rare) condition to the stack, so the driver must either
99   not call napi_complete_done() and wait to be called again,
100   or return ``budget - 1``.
101
102   If the ``budget`` is 0 napi_complete_done() should never be called.
103
104Call sequence
105-------------
106
107Drivers should not make assumptions about the exact sequencing
108of calls. The poll method may be called without the driver scheduling
109the instance (unless the instance is disabled). Similarly,
110it's not guaranteed that the poll method will be called, even
111if napi_schedule() succeeded (e.g. if the instance gets disabled).
112
113As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent
114calls to the poll method only wait for the ownership of the instance
115to be released, not for the poll method to exit. This means that
116drivers should avoid accessing any data structures after calling
117napi_complete_done().
118
119.. _drv_sched:
120
121Scheduling and IRQ masking
122--------------------------
123
124Drivers should keep the interrupts masked after scheduling
125the NAPI instance - until NAPI polling finishes any further
126interrupts are unnecessary.
127
128Drivers which have to mask the interrupts explicitly (as opposed
129to IRQ being auto-masked by the device) should use the napi_schedule_prep()
130and __napi_schedule() calls:
131
132.. code-block:: c
133
134  if (napi_schedule_prep(&v->napi)) {
135      mydrv_mask_rxtx_irq(v->idx);
136      /* schedule after masking to avoid races */
137      __napi_schedule(&v->napi);
138  }
139
140IRQ should only be unmasked after a successful call to napi_complete_done():
141
142.. code-block:: c
143
144  if (budget && napi_complete_done(&v->napi, work_done)) {
145    mydrv_unmask_rxtx_irq(v->idx);
146    return min(work_done, budget - 1);
147  }
148
149napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
150of guarantees given by being invoked in IRQ context (no need to
151mask interrupts). napi_schedule_irqoff() will fall back to napi_schedule() if
152IRQs are threaded (such as if ``PREEMPT_RT`` is enabled).
153
154Instance to queue mapping
155-------------------------
156
157Modern devices have multiple NAPI instances (struct napi_struct) per
158interface. There is no strong requirement on how the instances are
159mapped to queues and interrupts. NAPI is primarily a polling/processing
160abstraction without specific user-facing semantics. That said, most networking
161devices end up using NAPI in fairly similar ways.
162
163NAPI instances most often correspond 1:1:1 to interrupts and queue pairs
164(queue pair is a set of a single Rx and single Tx queue).
165
166In less common cases a NAPI instance may be used for multiple queues
167or Rx and Tx queues can be serviced by separate NAPI instances on a single
168core. Regardless of the queue assignment, however, there is usually still
169a 1:1 mapping between NAPI instances and interrupts.
170
171It's worth noting that the ethtool API uses a "channel" terminology where
172each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear
173what constitutes a channel; the recommended interpretation is to understand
174a channel as an IRQ/NAPI which services queues of a given type. For example,
175a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected
176to utilize 3 interrupts, 2 Rx and 2 Tx queues.
177
178Persistent NAPI config
179----------------------
180
181Drivers often allocate and free NAPI instances dynamically. This leads to loss
182of NAPI-related user configuration each time NAPI instances are reallocated.
183The netif_napi_add_config() API prevents this loss of configuration by
184associating each NAPI instance with a persistent NAPI configuration based on
185a driver defined index value, like a queue number.
186
187Using this API allows for persistent NAPI IDs (among other settings), which can
188be beneficial to userspace programs using ``SO_INCOMING_NAPI_ID``. See the
189sections below for other NAPI configuration settings.
190
191Drivers should try to use netif_napi_add_config() whenever possible.
192
193User API
194========
195
196User interactions with NAPI depend on NAPI instance ID. The instance IDs
197are visible to the user through the ``SO_INCOMING_NAPI_ID`` socket option
198and the netdev Netlink API.
199
200Users can query NAPI IDs for a device or device queue using netlink. This can
201be done programmatically in a user application or by using a script included in
202the kernel source tree: ``tools/net/ynl/pyynl/cli.py``.
203
204For example, using the script to dump all of the queues for a device (which
205will reveal each queue's NAPI ID):
206
207.. code-block:: bash
208
209   $ kernel-source/tools/net/ynl/pyynl/cli.py \
210             --spec Documentation/netlink/specs/netdev.yaml \
211             --dump queue-get \
212             --json='{"ifindex": 2}'
213
214See ``Documentation/netlink/specs/netdev.yaml`` for more details on
215available operations and attributes.
216
217Software IRQ coalescing
218-----------------------
219
220NAPI does not perform any explicit event coalescing by default.
221In most scenarios batching happens due to IRQ coalescing which is done
222by the device. There are cases where software coalescing is helpful.
223
224NAPI can be configured to arm a repoll timer instead of unmasking
225the hardware interrupts as soon as all packets are processed.
226The ``gro_flush_timeout`` sysfs configuration of the netdevice
227is reused to control the delay of the timer, while
228``napi_defer_hard_irqs`` controls the number of consecutive empty polls
229before NAPI gives up and goes back to using hardware IRQs.
230
231The above parameters can also be set on a per-NAPI basis using netlink via
232netdev-genl. When used with netlink and configured on a per-NAPI basis, the
233parameters mentioned above use hyphens instead of underscores:
234``gro-flush-timeout`` and ``napi-defer-hard-irqs``.
235
236Per-NAPI configuration can be done programmatically in a user application
237or by using a script included in the kernel source tree:
238``tools/net/ynl/pyynl/cli.py``.
239
240For example, using the script:
241
242.. code-block:: bash
243
244  $ kernel-source/tools/net/ynl/pyynl/cli.py \
245            --spec Documentation/netlink/specs/netdev.yaml \
246            --do napi-set \
247            --json='{"id": 345,
248                     "defer-hard-irqs": 111,
249                     "gro-flush-timeout": 11111}'
250
251Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
252via netdev-genl. There is no global sysfs parameter for this value.
253
254``irq-suspend-timeout`` is used to determine how long an application can
255completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
256which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
257
258.. _poll:
259
260Busy polling
261------------
262
263Busy polling allows a user process to check for incoming packets before
264the device interrupt fires. As is the case with any busy polling it trades
265off CPU cycles for lower latency (production uses of NAPI busy polling
266are not well known).
267
268Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
269selected sockets or using the global ``net.core.busy_poll`` and
270``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
271also exists. Threaded polling of NAPI also has a mode to busy poll for
272packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the NAPI
273processing kthread.
274
275epoll-based busy polling
276------------------------
277
278It is possible to trigger packet processing directly from calls to
279``epoll_wait``. In order to use this feature, a user application must ensure
280all file descriptors which are added to an epoll context have the same NAPI ID.
281
282If the application uses a dedicated acceptor thread, the application can obtain
283the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
284distribute that file descriptor to a worker thread. The worker thread would add
285the file descriptor to its epoll context. This would ensure each worker thread
286has an epoll context with FDs that have the same NAPI ID.
287
288Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program can
289be inserted to distribute incoming connections to threads such that each thread
290is only given incoming connections with the same NAPI ID. Care must be taken to
291carefully handle cases where a system may have multiple NICs.
292
293In order to enable busy polling, there are two choices:
294
2951. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
296   loop waiting for events. This is a system-wide setting and will cause all
297   epoll-based applications to busy poll when they call epoll_wait. This may
298   not be desirable as many applications may not have the need to busy poll.
299
3002. Applications using recent kernels can issue an ioctl on the epoll context
301   file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
302   epoll_params``:, which user programs can define as follows:
303
304.. code-block:: c
305
306  struct epoll_params {
307      uint32_t busy_poll_usecs;
308      uint16_t busy_poll_budget;
309      uint8_t prefer_busy_poll;
310
311      /* pad the struct to a multiple of 64bits */
312      uint8_t __pad;
313  };
314
315IRQ mitigation
316---------------
317
318While busy polling is supposed to be used by low latency applications,
319a similar mechanism can be used for IRQ mitigation.
320
321Very high request-per-second applications (especially routing/forwarding
322applications and especially applications using AF_XDP sockets) may not
323want to be interrupted until they finish processing a request or a batch
324of packets.
325
326Such applications can pledge to the kernel that they will perform a busy
327polling operation periodically, and the driver should keep the device IRQs
328permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
329socket option. To avoid system misbehavior the pledge is revoked
330if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
331busy polling applications, the ``prefer_busy_poll`` field of ``struct
332epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
333enable this mode. See the above section for more details.
334
335The NAPI budget for busy polling is lower than the default (which makes
336sense given the low latency intention of normal busy polling). This is
337not the case with IRQ mitigation, however, so the budget can be adjusted
338with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
339applications, the ``busy_poll_budget`` field can be adjusted to the desired value
340in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
341ioctl. See the above section for more details.
342
343It is important to note that choosing a large value for ``gro_flush_timeout``
344will defer IRQs to allow for better batch processing, but will induce latency
345when the system is not fully loaded. Choosing a small value for
346``gro_flush_timeout`` can cause interference of the user application which is
347attempting to busy poll by device IRQs and softirq processing. This value
348should be chosen carefully with these tradeoffs in mind. epoll-based busy
349polling applications may be able to mitigate how much user processing happens
350by choosing an appropriate value for ``maxevents``.
351
352Users may want to consider an alternate approach, IRQ suspension, to help deal
353with these tradeoffs.
354
355IRQ suspension
356--------------
357
358IRQ suspension is a mechanism wherein device IRQs are masked while epoll
359triggers NAPI packet processing.
360
361While application calls to epoll_wait successfully retrieve events, the kernel will
362defer the IRQ suspension timer. If the kernel does not retrieve any events
363while busy polling (for example, because network traffic levels subsided), IRQ
364suspension is disabled and the IRQ mitigation strategies described above are
365engaged.
366
367This allows users to balance CPU consumption with network processing
368efficiency.
369
370To use this mechanism:
371
372  1. The per-NAPI config parameter ``irq-suspend-timeout`` should be set to the
373     maximum time (in nanoseconds) the application can have its IRQs
374     suspended. This is done using netlink, as described above. This timeout
375     serves as a safety mechanism to restart IRQ driver interrupt processing if
376     the application has stalled. This value should be chosen so that it covers
377     the amount of time the user application needs to process data from its
378     call to epoll_wait, noting that applications can control how much data
379     they retrieve by setting ``maxevents`` when calling epoll_wait.
380
381  2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
382     and ``napi_defer_hard_irqs`` can be set to low values. They will be used
383     to defer IRQs after busy poll has found no data.
384
385  3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
386     the ``EPIOCSPARAMS`` ioctl as described above.
387
388  4. The application uses epoll as described above to trigger NAPI packet
389     processing.
390
391As mentioned above, as long as subsequent calls to epoll_wait return events to
392userland, the ``irq-suspend-timeout`` is deferred and IRQs are disabled. This
393allows the application to process data without interference.
394
395Once a call to epoll_wait results in no events being found, IRQ suspension is
396automatically disabled and the ``gro_flush_timeout`` and
397``napi_defer_hard_irqs`` mitigation mechanisms take over.
398
399It is expected that ``irq-suspend-timeout`` will be set to a value much larger
400than ``gro_flush_timeout`` as ``irq-suspend-timeout`` should suspend IRQs for
401the duration of one userland processing cycle.
402
403While it is not strictly necessary to use ``napi_defer_hard_irqs`` and
404``gro_flush_timeout`` to use IRQ suspension, their use is strongly
405recommended.
406
407IRQ suspension causes the system to alternate between polling mode and
408irq-driven packet delivery. During busy periods, ``irq-suspend-timeout``
409overrides ``gro_flush_timeout`` and keeps the system busy polling, but when
410epoll finds no events, the setting of ``gro_flush_timeout`` and
411``napi_defer_hard_irqs`` determine the next step.
412
413There are essentially three possible loops for network processing and
414packet delivery:
415
4161) hardirq -> softirq -> napi poll; basic interrupt delivery
4172) timer -> softirq -> napi poll; deferred irq processing
4183) epoll -> busy-poll -> napi poll; busy looping
419
420Loop 2 can take control from Loop 1, if ``gro_flush_timeout`` and
421``napi_defer_hard_irqs`` are set.
422
423If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are set, Loops 2
424and 3 "wrestle" with each other for control.
425
426During busy periods, ``irq-suspend-timeout`` is used as timer in Loop 2,
427which essentially tilts network processing in favour of Loop 3.
428
429If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are not set, Loop 3
430cannot take control from Loop 1.
431
432Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
433the recommended usage, because otherwise setting ``irq-suspend-timeout``
434might not have any discernible effect.
435
436.. _threaded_busy_poll:
437
438Threaded NAPI busy polling
439--------------------------
440
441Threaded NAPI busy polling extends threaded NAPI and adds support to do
442continuous busy polling of the NAPI. This can be useful for forwarding or
443AF_XDP applications.
444
445Threaded NAPI busy polling can be enabled on per NIC queue basis using Netlink.
446
447For example, using the following script:
448
449.. code-block:: bash
450
451  $ ynl --family netdev --do napi-set \
452            --json='{"id": 66, "threaded": "busy-poll"}'
453
454The kernel will create a kthread that busy polls on this NAPI.
455
456The user may elect to set the CPU affinity of this kthread to an unused CPU
457core to improve how often the NAPI is polled at the expense of wasted CPU
458cycles. Note that this will keep the CPU core busy with 100% usage.
459
460Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
461retrieved using Netlink so the affinity of the kthread can be set up.
462
463For example, the following script can be used to fetch the PID:
464
465.. code-block:: bash
466
467  $ ynl --family netdev --do napi-get --json='{"id": 66}'
468
469This will output something like following, the pid `258` is the PID of the
470kthread that is polling this NAPI.
471
472.. code-block:: bash
473
474  $ {'defer-hard-irqs': 0,
475     'gro-flush-timeout': 0,
476     'id': 66,
477     'ifindex': 2,
478     'irq-suspend-timeout': 0,
479     'pid': 258,
480     'threaded': 'busy-poll'}
481
482.. _threaded:
483
484Threaded NAPI
485-------------
486
487Threaded NAPI is an operating mode that uses dedicated kernel
488threads rather than software IRQ context for NAPI processing.
489Each threaded NAPI instance will spawn a separate thread
490(called ``napi/${ifc-name}-${napi-id}``).
491
492It is recommended to pin each kernel thread to a single CPU, the same
493CPU as the CPU which services the interrupt. Note that the mapping
494between IRQs and NAPI instances may not be trivial (and is driver
495dependent). The NAPI instance IDs will be assigned in the opposite
496order than the process IDs of the kernel threads.
497
498Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
499netdev's sysfs directory. It can also be enabled for a specific NAPI using
500netlink interface.
501
502For example, using the script:
503
504.. code-block:: bash
505
506  $ ynl --family netdev --do napi-set --json='{"id": 66, "threaded": 1}'
507
508.. rubric:: Footnotes
509
510.. [#] NAPI was originally referred to as New API in 2.4 Linux.
511