xref: /linux/Documentation/networking/device_drivers/ethernet/amazon/ena.rst (revision 9410645520e9b820069761f3450ef6661418e279)
1132db935SJakub Kicinski.. SPDX-License-Identifier: GPL-2.0
2132db935SJakub Kicinski
3132db935SJakub Kicinski============================================================
4132db935SJakub KicinskiLinux kernel driver for Elastic Network Adapter (ENA) family
5132db935SJakub Kicinski============================================================
6132db935SJakub Kicinski
7132db935SJakub KicinskiOverview
8132db935SJakub Kicinski========
9132db935SJakub Kicinski
10132db935SJakub KicinskiENA is a networking interface designed to make good use of modern CPU
11132db935SJakub Kicinskifeatures and system architectures.
12132db935SJakub Kicinski
13132db935SJakub KicinskiThe ENA device exposes a lightweight management interface with a
14511c537bSShay Agroskinminimal set of memory mapped registers and extendible command set
15132db935SJakub Kicinskithrough an Admin Queue.
16132db935SJakub Kicinski
17132db935SJakub KicinskiThe driver supports a range of ENA devices, is link-speed independent
18511c537bSShay Agroskin(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has
19511c537bSShay Agroskina negotiated and extendible feature set.
20132db935SJakub Kicinski
21132db935SJakub KicinskiSome ENA devices support SR-IOV. This driver is used for both the
22132db935SJakub KicinskiSR-IOV Physical Function (PF) and Virtual Function (VF) devices.
23132db935SJakub Kicinski
24132db935SJakub KicinskiENA devices enable high speed and low overhead network traffic
25132db935SJakub Kicinskiprocessing by providing multiple Tx/Rx queue pairs (the maximum number
26132db935SJakub Kicinskiis advertised by the device via the Admin Queue), a dedicated MSI-X
27132db935SJakub Kicinskiinterrupt vector per Tx/Rx queue pair, adaptive interrupt moderation,
28132db935SJakub Kicinskiand CPU cacheline optimized data placement.
29132db935SJakub Kicinski
30511c537bSShay AgroskinThe ENA driver supports industry standard TCP/IP offload features such as
31511c537bSShay Agroskinchecksum offload. Receive-side scaling (RSS) is supported for multi-core
32511c537bSShay Agroskinscaling.
33132db935SJakub Kicinski
34132db935SJakub KicinskiThe ENA driver and its corresponding devices implement health
35132db935SJakub Kicinskimonitoring mechanisms such as watchdog, enabling the device and driver
36132db935SJakub Kicinskito recover in a manner transparent to the application, as well as
37132db935SJakub Kicinskidebug logs.
38132db935SJakub Kicinski
39132db935SJakub KicinskiSome of the ENA devices support a working mode called Low-latency
40132db935SJakub KicinskiQueue (LLQ), which saves several more microseconds.
415dfbbaa2SDavid Arinzon
42132db935SJakub KicinskiENA Source Code Directory Structure
43132db935SJakub Kicinski===================================
44132db935SJakub Kicinski
45132db935SJakub Kicinski=================   ======================================================
46132db935SJakub Kicinskiena_com.[ch]        Management communication layer. This layer is
47132db935SJakub Kicinski                    responsible for the handling all the management
48132db935SJakub Kicinski                    (admin) communication between the device and the
49132db935SJakub Kicinski                    driver.
50132db935SJakub Kicinskiena_eth_com.[ch]    Tx/Rx data path.
51132db935SJakub Kicinskiena_admin_defs.h    Definition of ENA management interface.
52132db935SJakub Kicinskiena_eth_io_defs.h   Definition of ENA data path interface.
53132db935SJakub Kicinskiena_common_defs.h   Common definitions for ena_com layer.
54132db935SJakub Kicinskiena_regs_defs.h     Definition of ENA PCI memory-mapped (MMIO) registers.
55132db935SJakub Kicinskiena_netdev.[ch]     Main Linux kernel driver.
56132db935SJakub Kicinskiena_ethtool.c       ethtool callbacks.
57d000574dSDavid Arinzonena_xdp.[ch]        XDP files
58132db935SJakub Kicinskiena_pci_id_tbl.h    Supported device IDs.
59132db935SJakub Kicinski=================   ======================================================
60132db935SJakub Kicinski
61132db935SJakub KicinskiManagement Interface:
62132db935SJakub Kicinski=====================
63132db935SJakub Kicinski
64132db935SJakub KicinskiENA management interface is exposed by means of:
65132db935SJakub Kicinski
66132db935SJakub Kicinski- PCIe Configuration Space
67132db935SJakub Kicinski- Device Registers
68132db935SJakub Kicinski- Admin Queue (AQ) and Admin Completion Queue (ACQ)
69132db935SJakub Kicinski- Asynchronous Event Notification Queue (AENQ)
70132db935SJakub Kicinski
71132db935SJakub KicinskiENA device MMIO Registers are accessed only during driver
72511c537bSShay Agroskininitialization and are not used during further normal device
73132db935SJakub Kicinskioperation.
74132db935SJakub Kicinski
75132db935SJakub KicinskiAQ is used for submitting management commands, and the
76132db935SJakub Kicinskiresults/responses are reported asynchronously through ACQ.
77132db935SJakub Kicinski
78132db935SJakub KicinskiENA introduces a small set of management commands with room for
79132db935SJakub Kicinskivendor-specific extensions. Most of the management operations are
80132db935SJakub Kicinskiframed in a generic Get/Set feature command.
81132db935SJakub Kicinski
82132db935SJakub KicinskiThe following admin queue commands are supported:
83132db935SJakub Kicinski
84132db935SJakub Kicinski- Create I/O submission queue
85132db935SJakub Kicinski- Create I/O completion queue
86132db935SJakub Kicinski- Destroy I/O submission queue
87132db935SJakub Kicinski- Destroy I/O completion queue
88132db935SJakub Kicinski- Get feature
89132db935SJakub Kicinski- Set feature
90132db935SJakub Kicinski- Configure AENQ
91132db935SJakub Kicinski- Get statistics
92132db935SJakub Kicinski
93132db935SJakub KicinskiRefer to ena_admin_defs.h for the list of supported Get/Set Feature
94132db935SJakub Kicinskiproperties.
95132db935SJakub Kicinski
96132db935SJakub KicinskiThe Asynchronous Event Notification Queue (AENQ) is a uni-directional
97132db935SJakub Kicinskiqueue used by the ENA device to send to the driver events that cannot
98132db935SJakub Kicinskibe reported using ACQ. AENQ events are subdivided into groups. Each
99132db935SJakub Kicinskigroup may have multiple syndromes, as shown below
100132db935SJakub Kicinski
101132db935SJakub KicinskiThe events are:
102132db935SJakub Kicinski
103132db935SJakub Kicinski====================    ===============
104132db935SJakub KicinskiGroup                   Syndrome
105132db935SJakub Kicinski====================    ===============
106132db935SJakub KicinskiLink state change       **X**
107132db935SJakub KicinskiFatal error             **X**
108132db935SJakub KicinskiNotification            Suspend traffic
109132db935SJakub KicinskiNotification            Resume traffic
110132db935SJakub KicinskiKeep-Alive              **X**
111132db935SJakub Kicinski====================    ===============
112132db935SJakub Kicinski
113132db935SJakub KicinskiACQ and AENQ share the same MSI-X vector.
114132db935SJakub Kicinski
115511c537bSShay AgroskinKeep-Alive is a special mechanism that allows monitoring the device's health.
116511c537bSShay AgroskinA Keep-Alive event is delivered by the device every second.
117511c537bSShay AgroskinThe driver maintains a watchdog (WD) handler which logs the current state and
118511c537bSShay Agroskinstatistics. If the keep-alive events aren't delivered as expected the WD resets
119511c537bSShay Agroskinthe device and the driver.
120132db935SJakub Kicinski
121132db935SJakub KicinskiData Path Interface
122132db935SJakub Kicinski===================
123511c537bSShay Agroskin
124132db935SJakub KicinskiI/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
125132db935SJakub KicinskiSQ correspondingly). Each SQ has a completion queue (CQ) associated
126132db935SJakub Kicinskiwith it.
127132db935SJakub Kicinski
128132db935SJakub KicinskiThe SQs and CQs are implemented as descriptor rings in contiguous
129132db935SJakub Kicinskiphysical memory.
130132db935SJakub Kicinski
131132db935SJakub KicinskiThe ENA driver supports two Queue Operation modes for Tx SQs:
132132db935SJakub Kicinski
133511c537bSShay Agroskin- **Regular mode:**
134511c537bSShay Agroskin  In this mode the Tx SQs reside in the host's memory. The ENA
135132db935SJakub Kicinski  device fetches the ENA Tx descriptors and packet data from host
136132db935SJakub Kicinski  memory.
137132db935SJakub Kicinski
138511c537bSShay Agroskin- **Low Latency Queue (LLQ) mode or "push-mode":**
139511c537bSShay Agroskin  In this mode the driver pushes the transmit descriptors and the
140273a2397SArthur Kiyanovski  first 96 bytes of the packet directly to the ENA device memory
141132db935SJakub Kicinski  space. The rest of the packet payload is fetched by the
142132db935SJakub Kicinski  device. For this operation mode, the driver uses a dedicated PCI
143132db935SJakub Kicinski  device memory BAR, which is mapped with write-combine capability.
144132db935SJakub Kicinski
145511c537bSShay Agroskin  **Note that** not all ENA devices support LLQ, and this feature is negotiated
146132db935SJakub Kicinski  with the device upon initialization. If the ENA device does not
147132db935SJakub Kicinski  support LLQ mode, the driver falls back to the regular mode.
148132db935SJakub Kicinski
149511c537bSShay AgroskinThe Rx SQs support only the regular mode.
150511c537bSShay Agroskin
151132db935SJakub KicinskiThe driver supports multi-queue for both Tx and Rx. This has various
152132db935SJakub Kicinskibenefits:
153132db935SJakub Kicinski
154132db935SJakub Kicinski- Reduced CPU/thread/process contention on a given Ethernet interface.
155132db935SJakub Kicinski- Cache miss rate on completion is reduced, particularly for data
156132db935SJakub Kicinski  cache lines that hold the sk_buff structures.
157132db935SJakub Kicinski- Increased process-level parallelism when handling received packets.
158132db935SJakub Kicinski- Increased data cache hit rate, by steering kernel processing of
159132db935SJakub Kicinski  packets to the CPU, where the application thread consuming the
160132db935SJakub Kicinski  packet is running.
161132db935SJakub Kicinski- In hardware interrupt re-direction.
162132db935SJakub Kicinski
163132db935SJakub KicinskiInterrupt Modes
164132db935SJakub Kicinski===============
165511c537bSShay Agroskin
166132db935SJakub KicinskiThe driver assigns a single MSI-X vector per queue pair (for both Tx
167132db935SJakub Kicinskiand Rx directions). The driver assigns an additional dedicated MSI-X vector
168132db935SJakub Kicinskifor management (for ACQ and AENQ).
169132db935SJakub Kicinski
170132db935SJakub KicinskiManagement interrupt registration is performed when the Linux kernel
171132db935SJakub Kicinskiprobes the adapter, and it is de-registered when the adapter is
172132db935SJakub Kicinskiremoved. I/O queue interrupt registration is performed when the Linux
173132db935SJakub Kicinskiinterface of the adapter is opened, and it is de-registered when the
174132db935SJakub Kicinskiinterface is closed.
175132db935SJakub Kicinski
176132db935SJakub KicinskiThe management interrupt is named::
177132db935SJakub Kicinski
178132db935SJakub Kicinski   ena-mgmnt@pci:<PCI domain:bus:slot.function>
179132db935SJakub Kicinski
180132db935SJakub Kicinskiand for each queue pair, an interrupt is named::
181132db935SJakub Kicinski
182132db935SJakub Kicinski   <interface name>-Tx-Rx-<queue index>
183132db935SJakub Kicinski
184132db935SJakub KicinskiThe ENA device operates in auto-mask and auto-clear interrupt
185132db935SJakub Kicinskimodes. That is, once MSI-X is delivered to the host, its Cause bit is
186132db935SJakub Kicinskiautomatically cleared and the interrupt is masked. The interrupt is
187132db935SJakub Kicinskiunmasked by the driver after NAPI processing is complete.
188132db935SJakub Kicinski
189132db935SJakub KicinskiInterrupt Moderation
190132db935SJakub Kicinski====================
191511c537bSShay Agroskin
192132db935SJakub KicinskiENA driver and device can operate in conventional or adaptive interrupt
193132db935SJakub Kicinskimoderation mode.
194132db935SJakub Kicinski
195511c537bSShay Agroskin**In conventional mode** the driver instructs device to postpone interrupt
196132db935SJakub Kicinskiposting according to static interrupt delay value. The interrupt delay
197511c537bSShay Agroskinvalue can be configured through `ethtool(8)`. The following `ethtool`
198511c537bSShay Agroskinparameters are supported by the driver: ``tx-usecs``, ``rx-usecs``
199132db935SJakub Kicinski
200511c537bSShay Agroskin**In adaptive interrupt** moderation mode the interrupt delay value is
201132db935SJakub Kicinskiupdated by the driver dynamically and adjusted every NAPI cycle
202132db935SJakub Kicinskiaccording to the traffic nature.
203132db935SJakub Kicinski
204511c537bSShay AgroskinAdaptive coalescing can be switched on/off through `ethtool(8)`'s
205511c537bSShay Agroskin:code:`adaptive_rx on|off` parameter.
206132db935SJakub Kicinski
207c452f375SShay AgroskinMore information about Adaptive Interrupt Moderation (DIM) can be found in
208c452f375SShay AgroskinDocumentation/networking/net_dim.rst
209132db935SJakub Kicinski
210f7d625adSDavid Arinzon.. _`RX copybreak`:
2115dfbbaa2SDavid Arinzon
212132db935SJakub KicinskiRX copybreak
213132db935SJakub Kicinski============
214bd765cc9SDavid Arinzon
215132db935SJakub KicinskiThe rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
216132db935SJakub Kicinskiand can be configured by the ETHTOOL_STUNABLE command of the
217132db935SJakub KicinskiSIOCETHTOOL ioctl.
218132db935SJakub Kicinski
219bd765cc9SDavid ArinzonThis option controls the maximum packet length for which the RX
220bd765cc9SDavid Arinzondescriptor it was received on would be recycled. When a packet smaller
221bd765cc9SDavid Arinzonthan RX copybreak bytes is received, it is copied into a new memory
222bd765cc9SDavid Arinzonbuffer and the RX descriptor is returned to HW.
223bd765cc9SDavid Arinzon
224132db935SJakub KicinskiStatistics
225132db935SJakub Kicinski==========
226511c537bSShay Agroskin
227511c537bSShay AgroskinThe user can obtain ENA device and driver statistics using `ethtool`.
228132db935SJakub KicinskiThe driver can collect regular or extended statistics (including
229132db935SJakub Kicinskiper-queue stats) from the device.
230132db935SJakub Kicinski
231132db935SJakub KicinskiIn addition the driver logs the stats to syslog upon device reset.
232132db935SJakub Kicinski
233*49f66e12SDavid ArinzonOn supported instance types, the statistics will also include the
234*49f66e12SDavid ArinzonENA Express data (fields prefixed with `ena_srd`). For a complete
235*49f66e12SDavid Arinzondocumentation of ENA Express data refer to
236*49f66e12SDavid Arinzonhttps://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-express.html#ena-express-monitor
237*49f66e12SDavid Arinzon
238132db935SJakub KicinskiMTU
239132db935SJakub Kicinski===
240511c537bSShay Agroskin
241132db935SJakub KicinskiThe driver supports an arbitrarily large MTU with a maximum that is
242132db935SJakub Kicinskinegotiated with the device. The driver configures MTU using the
243132db935SJakub KicinskiSetFeature command (ENA_ADMIN_MTU property). The user can change MTU
244511c537bSShay Agroskinvia `ip(8)` and similar legacy tools.
245132db935SJakub Kicinski
246132db935SJakub KicinskiStateless Offloads
247132db935SJakub Kicinski==================
248511c537bSShay Agroskin
249132db935SJakub KicinskiThe ENA driver supports:
250132db935SJakub Kicinski
251132db935SJakub Kicinski- IPv4 header checksum offload
252132db935SJakub Kicinski- TCP/UDP over IPv4/IPv6 checksum offloads
253132db935SJakub Kicinski
254132db935SJakub KicinskiRSS
255132db935SJakub Kicinski===
256511c537bSShay Agroskin
257132db935SJakub Kicinski- The ENA device supports RSS that allows flexible Rx traffic
258132db935SJakub Kicinski  steering.
259132db935SJakub Kicinski- Toeplitz and CRC32 hash functions are supported.
260132db935SJakub Kicinski- Different combinations of L2/L3/L4 fields can be configured as
261132db935SJakub Kicinski  inputs for hash functions.
262132db935SJakub Kicinski- The driver configures RSS settings using the AQ SetFeature command
263132db935SJakub Kicinski  (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and
2640deca83fSShay Agroskin  ENA_ADMIN_RSS_INDIRECTION_TABLE_CONFIG properties).
265132db935SJakub Kicinski- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash
266132db935SJakub Kicinski  function delivered in the Rx CQ descriptor is set in the received
267132db935SJakub Kicinski  SKB.
268132db935SJakub Kicinski- The user can provide a hash key, hash function, and configure the
269511c537bSShay Agroskin  indirection table through `ethtool(8)`.
270132db935SJakub Kicinski
271132db935SJakub KicinskiDATA PATH
272132db935SJakub Kicinski=========
273511c537bSShay Agroskin
274132db935SJakub KicinskiTx
275132db935SJakub Kicinski--
276132db935SJakub Kicinski
277511c537bSShay Agroskin:code:`ena_start_xmit()` is called by the stack. This function does the following:
278132db935SJakub Kicinski
279511c537bSShay Agroskin- Maps data buffers (``skb->data`` and frags).
280511c537bSShay Agroskin- Populates ``ena_buf`` for the push buffer (if the driver and device are
281511c537bSShay Agroskin  in push mode).
282132db935SJakub Kicinski- Prepares ENA bufs for the remaining frags.
283511c537bSShay Agroskin- Allocates a new request ID from the empty ``req_id`` ring. The request
284132db935SJakub Kicinski  ID is the index of the packet in the Tx info. This is used for
285511c537bSShay Agroskin  out-of-order Tx completions.
286132db935SJakub Kicinski- Adds the packet to the proper place in the Tx ring.
287511c537bSShay Agroskin- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts
288511c537bSShay Agroskin  the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as
289511c537bSShay Agroskin  needed).
290132db935SJakub Kicinski
291132db935SJakub Kicinski  * This function also copies the ENA descriptors and the push buffer
292511c537bSShay Agroskin    to the Device memory space (if in push mode).
293132db935SJakub Kicinski
294511c537bSShay Agroskin- Writes a doorbell to the ENA device.
295132db935SJakub Kicinski- When the ENA device finishes sending the packet, a completion
296132db935SJakub Kicinski  interrupt is raised.
297132db935SJakub Kicinski- The interrupt handler schedules NAPI.
298511c537bSShay Agroskin- The :code:`ena_clean_tx_irq()` function is called. This function handles the
299132db935SJakub Kicinski  completion descriptors generated by the ENA, with a single
300132db935SJakub Kicinski  completion descriptor per completed packet.
301132db935SJakub Kicinski
302511c537bSShay Agroskin  * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of
303511c537bSShay Agroskin    the packet is retrieved via the ``req_id``. The data buffers are
304511c537bSShay Agroskin    unmapped and ``req_id`` is returned to the empty ``req_id`` ring.
305132db935SJakub Kicinski  * The function stops when the completion descriptors are completed or
306132db935SJakub Kicinski    the budget is reached.
307132db935SJakub Kicinski
308132db935SJakub KicinskiRx
309132db935SJakub Kicinski--
310132db935SJakub Kicinski
311132db935SJakub Kicinski- When a packet is received from the ENA device.
312132db935SJakub Kicinski- The interrupt handler schedules NAPI.
313511c537bSShay Agroskin- The :code:`ena_clean_rx_irq()` function is called. This function calls
314511c537bSShay Agroskin  :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the
315511c537bSShay Agroskin  number of descriptors used for a new packet, and zero if
316132db935SJakub Kicinski  no new packet is found.
317511c537bSShay Agroskin- :code:`ena_rx_skb()` checks packet length:
318132db935SJakub Kicinski
319132db935SJakub Kicinski  * If the packet is small (len < rx_copybreak), the driver allocates
320132db935SJakub Kicinski    a SKB for the new packet, and copies the packet payload into the
321132db935SJakub Kicinski    SKB data buffer.
322132db935SJakub Kicinski
323132db935SJakub Kicinski    - In this way the original data buffer is not passed to the stack
324132db935SJakub Kicinski      and is reused for future Rx packets.
325132db935SJakub Kicinski
326511c537bSShay Agroskin  * Otherwise the function unmaps the Rx buffer, sets the first
327511c537bSShay Agroskin    descriptor as `skb`'s linear part and the other descriptors as the
328511c537bSShay Agroskin    `skb`'s frags.
329132db935SJakub Kicinski
330132db935SJakub Kicinski- The new SKB is updated with the necessary information (protocol,
331511c537bSShay Agroskin  checksum hw verify result, etc), and then passed to the network
332511c537bSShay Agroskin  stack, using the NAPI interface function :code:`napi_gro_receive()`.
333f7d625adSDavid Arinzon
334f7d625adSDavid ArinzonDynamic RX Buffers (DRB)
335f7d625adSDavid Arinzon------------------------
336f7d625adSDavid Arinzon
337f7d625adSDavid ArinzonEach RX descriptor in the RX ring is a single memory page (which is either 4KB
338f7d625adSDavid Arinzonor 16KB long depending on system's configurations).
339f7d625adSDavid ArinzonTo reduce the memory allocations required when dealing with a high rate of small
340f7d625adSDavid Arinzonpackets, the driver tries to reuse the remaining RX descriptor's space if more
341f7d625adSDavid Arinzonthan 2KB of this page remain unused.
342f7d625adSDavid Arinzon
343f7d625adSDavid ArinzonA simple example of this mechanism is the following sequence of events:
344f7d625adSDavid Arinzon
345f7d625adSDavid Arinzon::
346f7d625adSDavid Arinzon
347f7d625adSDavid Arinzon        1. Driver allocates page-sized RX buffer and passes it to hardware
348f7d625adSDavid Arinzon                +----------------------+
349f7d625adSDavid Arinzon                |4KB RX Buffer         |
350f7d625adSDavid Arinzon                +----------------------+
351f7d625adSDavid Arinzon
352f7d625adSDavid Arinzon        2. A 300Bytes packet is received on this buffer
353f7d625adSDavid Arinzon
354f7d625adSDavid Arinzon        3. The driver increases the ref count on this page and returns it back to
355f7d625adSDavid Arinzon           HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes
356f7d625adSDavid Arinzon               +----+--------------------+
357f7d625adSDavid Arinzon               |****|3796 Bytes RX Buffer|
358f7d625adSDavid Arinzon               +----+--------------------+
359f7d625adSDavid Arinzon
360f7d625adSDavid ArinzonThis mechanism isn't used when an XDP program is loaded, or when the
361f7d625adSDavid ArinzonRX packet is less than rx_copybreak bytes (in which case the packet is
362f7d625adSDavid Arinzoncopied out of the RX buffer into the linear part of a new skb allocated
363f7d625adSDavid Arinzonfor it and the RX buffer remains the same size, see `RX copybreak`_).
364