1132db935SJakub Kicinski.. SPDX-License-Identifier: GPL-2.0 2132db935SJakub Kicinski 3132db935SJakub Kicinski============================================================ 4132db935SJakub KicinskiLinux kernel driver for Elastic Network Adapter (ENA) family 5132db935SJakub Kicinski============================================================ 6132db935SJakub Kicinski 7132db935SJakub KicinskiOverview 8132db935SJakub Kicinski======== 9132db935SJakub Kicinski 10132db935SJakub KicinskiENA is a networking interface designed to make good use of modern CPU 11132db935SJakub Kicinskifeatures and system architectures. 12132db935SJakub Kicinski 13132db935SJakub KicinskiThe ENA device exposes a lightweight management interface with a 14511c537bSShay Agroskinminimal set of memory mapped registers and extendible command set 15132db935SJakub Kicinskithrough an Admin Queue. 16132db935SJakub Kicinski 17132db935SJakub KicinskiThe driver supports a range of ENA devices, is link-speed independent 18511c537bSShay Agroskin(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has 19511c537bSShay Agroskina negotiated and extendible feature set. 20132db935SJakub Kicinski 21132db935SJakub KicinskiSome ENA devices support SR-IOV. This driver is used for both the 22132db935SJakub KicinskiSR-IOV Physical Function (PF) and Virtual Function (VF) devices. 23132db935SJakub Kicinski 24132db935SJakub KicinskiENA devices enable high speed and low overhead network traffic 25132db935SJakub Kicinskiprocessing by providing multiple Tx/Rx queue pairs (the maximum number 26132db935SJakub Kicinskiis advertised by the device via the Admin Queue), a dedicated MSI-X 27132db935SJakub Kicinskiinterrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, 28132db935SJakub Kicinskiand CPU cacheline optimized data placement. 29132db935SJakub Kicinski 30511c537bSShay AgroskinThe ENA driver supports industry standard TCP/IP offload features such as 31511c537bSShay Agroskinchecksum offload. Receive-side scaling (RSS) is supported for multi-core 32511c537bSShay Agroskinscaling. 33132db935SJakub Kicinski 34132db935SJakub KicinskiThe ENA driver and its corresponding devices implement health 35132db935SJakub Kicinskimonitoring mechanisms such as watchdog, enabling the device and driver 36132db935SJakub Kicinskito recover in a manner transparent to the application, as well as 37132db935SJakub Kicinskidebug logs. 38132db935SJakub Kicinski 39132db935SJakub KicinskiSome of the ENA devices support a working mode called Low-latency 40132db935SJakub KicinskiQueue (LLQ), which saves several more microseconds. 415dfbbaa2SDavid Arinzon 42132db935SJakub KicinskiENA Source Code Directory Structure 43132db935SJakub Kicinski=================================== 44132db935SJakub Kicinski 45132db935SJakub Kicinski================= ====================================================== 46132db935SJakub Kicinskiena_com.[ch] Management communication layer. This layer is 47132db935SJakub Kicinski responsible for the handling all the management 48132db935SJakub Kicinski (admin) communication between the device and the 49132db935SJakub Kicinski driver. 50132db935SJakub Kicinskiena_eth_com.[ch] Tx/Rx data path. 51132db935SJakub Kicinskiena_admin_defs.h Definition of ENA management interface. 52132db935SJakub Kicinskiena_eth_io_defs.h Definition of ENA data path interface. 53132db935SJakub Kicinskiena_common_defs.h Common definitions for ena_com layer. 54132db935SJakub Kicinskiena_regs_defs.h Definition of ENA PCI memory-mapped (MMIO) registers. 55132db935SJakub Kicinskiena_netdev.[ch] Main Linux kernel driver. 56132db935SJakub Kicinskiena_ethtool.c ethtool callbacks. 57d000574dSDavid Arinzonena_xdp.[ch] XDP files 58132db935SJakub Kicinskiena_pci_id_tbl.h Supported device IDs. 59132db935SJakub Kicinski================= ====================================================== 60132db935SJakub Kicinski 61132db935SJakub KicinskiManagement Interface: 62132db935SJakub Kicinski===================== 63132db935SJakub Kicinski 64132db935SJakub KicinskiENA management interface is exposed by means of: 65132db935SJakub Kicinski 66132db935SJakub Kicinski- PCIe Configuration Space 67132db935SJakub Kicinski- Device Registers 68132db935SJakub Kicinski- Admin Queue (AQ) and Admin Completion Queue (ACQ) 69132db935SJakub Kicinski- Asynchronous Event Notification Queue (AENQ) 70132db935SJakub Kicinski 71132db935SJakub KicinskiENA device MMIO Registers are accessed only during driver 72511c537bSShay Agroskininitialization and are not used during further normal device 73132db935SJakub Kicinskioperation. 74132db935SJakub Kicinski 75132db935SJakub KicinskiAQ is used for submitting management commands, and the 76132db935SJakub Kicinskiresults/responses are reported asynchronously through ACQ. 77132db935SJakub Kicinski 78132db935SJakub KicinskiENA introduces a small set of management commands with room for 79132db935SJakub Kicinskivendor-specific extensions. Most of the management operations are 80132db935SJakub Kicinskiframed in a generic Get/Set feature command. 81132db935SJakub Kicinski 82132db935SJakub KicinskiThe following admin queue commands are supported: 83132db935SJakub Kicinski 84132db935SJakub Kicinski- Create I/O submission queue 85132db935SJakub Kicinski- Create I/O completion queue 86132db935SJakub Kicinski- Destroy I/O submission queue 87132db935SJakub Kicinski- Destroy I/O completion queue 88132db935SJakub Kicinski- Get feature 89132db935SJakub Kicinski- Set feature 90132db935SJakub Kicinski- Configure AENQ 91132db935SJakub Kicinski- Get statistics 92132db935SJakub Kicinski 93132db935SJakub KicinskiRefer to ena_admin_defs.h for the list of supported Get/Set Feature 94132db935SJakub Kicinskiproperties. 95132db935SJakub Kicinski 96132db935SJakub KicinskiThe Asynchronous Event Notification Queue (AENQ) is a uni-directional 97132db935SJakub Kicinskiqueue used by the ENA device to send to the driver events that cannot 98132db935SJakub Kicinskibe reported using ACQ. AENQ events are subdivided into groups. Each 99132db935SJakub Kicinskigroup may have multiple syndromes, as shown below 100132db935SJakub Kicinski 101132db935SJakub KicinskiThe events are: 102132db935SJakub Kicinski 103132db935SJakub Kicinski==================== =============== 104132db935SJakub KicinskiGroup Syndrome 105132db935SJakub Kicinski==================== =============== 106132db935SJakub KicinskiLink state change **X** 107132db935SJakub KicinskiFatal error **X** 108132db935SJakub KicinskiNotification Suspend traffic 109132db935SJakub KicinskiNotification Resume traffic 110132db935SJakub KicinskiKeep-Alive **X** 111132db935SJakub Kicinski==================== =============== 112132db935SJakub Kicinski 113132db935SJakub KicinskiACQ and AENQ share the same MSI-X vector. 114132db935SJakub Kicinski 115511c537bSShay AgroskinKeep-Alive is a special mechanism that allows monitoring the device's health. 116511c537bSShay AgroskinA Keep-Alive event is delivered by the device every second. 117511c537bSShay AgroskinThe driver maintains a watchdog (WD) handler which logs the current state and 118511c537bSShay Agroskinstatistics. If the keep-alive events aren't delivered as expected the WD resets 119511c537bSShay Agroskinthe device and the driver. 120132db935SJakub Kicinski 121132db935SJakub KicinskiData Path Interface 122132db935SJakub Kicinski=================== 123511c537bSShay Agroskin 124132db935SJakub KicinskiI/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx 125132db935SJakub KicinskiSQ correspondingly). Each SQ has a completion queue (CQ) associated 126132db935SJakub Kicinskiwith it. 127132db935SJakub Kicinski 128132db935SJakub KicinskiThe SQs and CQs are implemented as descriptor rings in contiguous 129132db935SJakub Kicinskiphysical memory. 130132db935SJakub Kicinski 131132db935SJakub KicinskiThe ENA driver supports two Queue Operation modes for Tx SQs: 132132db935SJakub Kicinski 133511c537bSShay Agroskin- **Regular mode:** 134511c537bSShay Agroskin In this mode the Tx SQs reside in the host's memory. The ENA 135132db935SJakub Kicinski device fetches the ENA Tx descriptors and packet data from host 136132db935SJakub Kicinski memory. 137132db935SJakub Kicinski 138511c537bSShay Agroskin- **Low Latency Queue (LLQ) mode or "push-mode":** 139511c537bSShay Agroskin In this mode the driver pushes the transmit descriptors and the 140273a2397SArthur Kiyanovski first 96 bytes of the packet directly to the ENA device memory 141132db935SJakub Kicinski space. The rest of the packet payload is fetched by the 142132db935SJakub Kicinski device. For this operation mode, the driver uses a dedicated PCI 143132db935SJakub Kicinski device memory BAR, which is mapped with write-combine capability. 144132db935SJakub Kicinski 145511c537bSShay Agroskin **Note that** not all ENA devices support LLQ, and this feature is negotiated 146132db935SJakub Kicinski with the device upon initialization. If the ENA device does not 147132db935SJakub Kicinski support LLQ mode, the driver falls back to the regular mode. 148132db935SJakub Kicinski 149511c537bSShay AgroskinThe Rx SQs support only the regular mode. 150511c537bSShay Agroskin 151132db935SJakub KicinskiThe driver supports multi-queue for both Tx and Rx. This has various 152132db935SJakub Kicinskibenefits: 153132db935SJakub Kicinski 154132db935SJakub Kicinski- Reduced CPU/thread/process contention on a given Ethernet interface. 155132db935SJakub Kicinski- Cache miss rate on completion is reduced, particularly for data 156132db935SJakub Kicinski cache lines that hold the sk_buff structures. 157132db935SJakub Kicinski- Increased process-level parallelism when handling received packets. 158132db935SJakub Kicinski- Increased data cache hit rate, by steering kernel processing of 159132db935SJakub Kicinski packets to the CPU, where the application thread consuming the 160132db935SJakub Kicinski packet is running. 161132db935SJakub Kicinski- In hardware interrupt re-direction. 162132db935SJakub Kicinski 163132db935SJakub KicinskiInterrupt Modes 164132db935SJakub Kicinski=============== 165511c537bSShay Agroskin 166132db935SJakub KicinskiThe driver assigns a single MSI-X vector per queue pair (for both Tx 167132db935SJakub Kicinskiand Rx directions). The driver assigns an additional dedicated MSI-X vector 168132db935SJakub Kicinskifor management (for ACQ and AENQ). 169132db935SJakub Kicinski 170132db935SJakub KicinskiManagement interrupt registration is performed when the Linux kernel 171132db935SJakub Kicinskiprobes the adapter, and it is de-registered when the adapter is 172132db935SJakub Kicinskiremoved. I/O queue interrupt registration is performed when the Linux 173132db935SJakub Kicinskiinterface of the adapter is opened, and it is de-registered when the 174132db935SJakub Kicinskiinterface is closed. 175132db935SJakub Kicinski 176132db935SJakub KicinskiThe management interrupt is named:: 177132db935SJakub Kicinski 178132db935SJakub Kicinski ena-mgmnt@pci:<PCI domain:bus:slot.function> 179132db935SJakub Kicinski 180132db935SJakub Kicinskiand for each queue pair, an interrupt is named:: 181132db935SJakub Kicinski 182132db935SJakub Kicinski <interface name>-Tx-Rx-<queue index> 183132db935SJakub Kicinski 184132db935SJakub KicinskiThe ENA device operates in auto-mask and auto-clear interrupt 185132db935SJakub Kicinskimodes. That is, once MSI-X is delivered to the host, its Cause bit is 186132db935SJakub Kicinskiautomatically cleared and the interrupt is masked. The interrupt is 187132db935SJakub Kicinskiunmasked by the driver after NAPI processing is complete. 188132db935SJakub Kicinski 189132db935SJakub KicinskiInterrupt Moderation 190132db935SJakub Kicinski==================== 191511c537bSShay Agroskin 192132db935SJakub KicinskiENA driver and device can operate in conventional or adaptive interrupt 193132db935SJakub Kicinskimoderation mode. 194132db935SJakub Kicinski 195511c537bSShay Agroskin**In conventional mode** the driver instructs device to postpone interrupt 196132db935SJakub Kicinskiposting according to static interrupt delay value. The interrupt delay 197511c537bSShay Agroskinvalue can be configured through `ethtool(8)`. The following `ethtool` 198511c537bSShay Agroskinparameters are supported by the driver: ``tx-usecs``, ``rx-usecs`` 199132db935SJakub Kicinski 200511c537bSShay Agroskin**In adaptive interrupt** moderation mode the interrupt delay value is 201132db935SJakub Kicinskiupdated by the driver dynamically and adjusted every NAPI cycle 202132db935SJakub Kicinskiaccording to the traffic nature. 203132db935SJakub Kicinski 204511c537bSShay AgroskinAdaptive coalescing can be switched on/off through `ethtool(8)`'s 205511c537bSShay Agroskin:code:`adaptive_rx on|off` parameter. 206132db935SJakub Kicinski 207c452f375SShay AgroskinMore information about Adaptive Interrupt Moderation (DIM) can be found in 208c452f375SShay AgroskinDocumentation/networking/net_dim.rst 209132db935SJakub Kicinski 210f7d625adSDavid Arinzon.. _`RX copybreak`: 2115dfbbaa2SDavid Arinzon 212132db935SJakub KicinskiRX copybreak 213132db935SJakub Kicinski============ 214bd765cc9SDavid Arinzon 215132db935SJakub KicinskiThe rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK 216132db935SJakub Kicinskiand can be configured by the ETHTOOL_STUNABLE command of the 217132db935SJakub KicinskiSIOCETHTOOL ioctl. 218132db935SJakub Kicinski 219bd765cc9SDavid ArinzonThis option controls the maximum packet length for which the RX 220bd765cc9SDavid Arinzondescriptor it was received on would be recycled. When a packet smaller 221bd765cc9SDavid Arinzonthan RX copybreak bytes is received, it is copied into a new memory 222bd765cc9SDavid Arinzonbuffer and the RX descriptor is returned to HW. 223bd765cc9SDavid Arinzon 224132db935SJakub KicinskiStatistics 225132db935SJakub Kicinski========== 226511c537bSShay Agroskin 227511c537bSShay AgroskinThe user can obtain ENA device and driver statistics using `ethtool`. 228132db935SJakub KicinskiThe driver can collect regular or extended statistics (including 229132db935SJakub Kicinskiper-queue stats) from the device. 230132db935SJakub Kicinski 231132db935SJakub KicinskiIn addition the driver logs the stats to syslog upon device reset. 232132db935SJakub Kicinski 233*49f66e12SDavid ArinzonOn supported instance types, the statistics will also include the 234*49f66e12SDavid ArinzonENA Express data (fields prefixed with `ena_srd`). For a complete 235*49f66e12SDavid Arinzondocumentation of ENA Express data refer to 236*49f66e12SDavid Arinzonhttps://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-express.html#ena-express-monitor 237*49f66e12SDavid Arinzon 238132db935SJakub KicinskiMTU 239132db935SJakub Kicinski=== 240511c537bSShay Agroskin 241132db935SJakub KicinskiThe driver supports an arbitrarily large MTU with a maximum that is 242132db935SJakub Kicinskinegotiated with the device. The driver configures MTU using the 243132db935SJakub KicinskiSetFeature command (ENA_ADMIN_MTU property). The user can change MTU 244511c537bSShay Agroskinvia `ip(8)` and similar legacy tools. 245132db935SJakub Kicinski 246132db935SJakub KicinskiStateless Offloads 247132db935SJakub Kicinski================== 248511c537bSShay Agroskin 249132db935SJakub KicinskiThe ENA driver supports: 250132db935SJakub Kicinski 251132db935SJakub Kicinski- IPv4 header checksum offload 252132db935SJakub Kicinski- TCP/UDP over IPv4/IPv6 checksum offloads 253132db935SJakub Kicinski 254132db935SJakub KicinskiRSS 255132db935SJakub Kicinski=== 256511c537bSShay Agroskin 257132db935SJakub Kicinski- The ENA device supports RSS that allows flexible Rx traffic 258132db935SJakub Kicinski steering. 259132db935SJakub Kicinski- Toeplitz and CRC32 hash functions are supported. 260132db935SJakub Kicinski- Different combinations of L2/L3/L4 fields can be configured as 261132db935SJakub Kicinski inputs for hash functions. 262132db935SJakub Kicinski- The driver configures RSS settings using the AQ SetFeature command 263132db935SJakub Kicinski (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and 2640deca83fSShay Agroskin ENA_ADMIN_RSS_INDIRECTION_TABLE_CONFIG properties). 265132db935SJakub Kicinski- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash 266132db935SJakub Kicinski function delivered in the Rx CQ descriptor is set in the received 267132db935SJakub Kicinski SKB. 268132db935SJakub Kicinski- The user can provide a hash key, hash function, and configure the 269511c537bSShay Agroskin indirection table through `ethtool(8)`. 270132db935SJakub Kicinski 271132db935SJakub KicinskiDATA PATH 272132db935SJakub Kicinski========= 273511c537bSShay Agroskin 274132db935SJakub KicinskiTx 275132db935SJakub Kicinski-- 276132db935SJakub Kicinski 277511c537bSShay Agroskin:code:`ena_start_xmit()` is called by the stack. This function does the following: 278132db935SJakub Kicinski 279511c537bSShay Agroskin- Maps data buffers (``skb->data`` and frags). 280511c537bSShay Agroskin- Populates ``ena_buf`` for the push buffer (if the driver and device are 281511c537bSShay Agroskin in push mode). 282132db935SJakub Kicinski- Prepares ENA bufs for the remaining frags. 283511c537bSShay Agroskin- Allocates a new request ID from the empty ``req_id`` ring. The request 284132db935SJakub Kicinski ID is the index of the packet in the Tx info. This is used for 285511c537bSShay Agroskin out-of-order Tx completions. 286132db935SJakub Kicinski- Adds the packet to the proper place in the Tx ring. 287511c537bSShay Agroskin- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts 288511c537bSShay Agroskin the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as 289511c537bSShay Agroskin needed). 290132db935SJakub Kicinski 291132db935SJakub Kicinski * This function also copies the ENA descriptors and the push buffer 292511c537bSShay Agroskin to the Device memory space (if in push mode). 293132db935SJakub Kicinski 294511c537bSShay Agroskin- Writes a doorbell to the ENA device. 295132db935SJakub Kicinski- When the ENA device finishes sending the packet, a completion 296132db935SJakub Kicinski interrupt is raised. 297132db935SJakub Kicinski- The interrupt handler schedules NAPI. 298511c537bSShay Agroskin- The :code:`ena_clean_tx_irq()` function is called. This function handles the 299132db935SJakub Kicinski completion descriptors generated by the ENA, with a single 300132db935SJakub Kicinski completion descriptor per completed packet. 301132db935SJakub Kicinski 302511c537bSShay Agroskin * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of 303511c537bSShay Agroskin the packet is retrieved via the ``req_id``. The data buffers are 304511c537bSShay Agroskin unmapped and ``req_id`` is returned to the empty ``req_id`` ring. 305132db935SJakub Kicinski * The function stops when the completion descriptors are completed or 306132db935SJakub Kicinski the budget is reached. 307132db935SJakub Kicinski 308132db935SJakub KicinskiRx 309132db935SJakub Kicinski-- 310132db935SJakub Kicinski 311132db935SJakub Kicinski- When a packet is received from the ENA device. 312132db935SJakub Kicinski- The interrupt handler schedules NAPI. 313511c537bSShay Agroskin- The :code:`ena_clean_rx_irq()` function is called. This function calls 314511c537bSShay Agroskin :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the 315511c537bSShay Agroskin number of descriptors used for a new packet, and zero if 316132db935SJakub Kicinski no new packet is found. 317511c537bSShay Agroskin- :code:`ena_rx_skb()` checks packet length: 318132db935SJakub Kicinski 319132db935SJakub Kicinski * If the packet is small (len < rx_copybreak), the driver allocates 320132db935SJakub Kicinski a SKB for the new packet, and copies the packet payload into the 321132db935SJakub Kicinski SKB data buffer. 322132db935SJakub Kicinski 323132db935SJakub Kicinski - In this way the original data buffer is not passed to the stack 324132db935SJakub Kicinski and is reused for future Rx packets. 325132db935SJakub Kicinski 326511c537bSShay Agroskin * Otherwise the function unmaps the Rx buffer, sets the first 327511c537bSShay Agroskin descriptor as `skb`'s linear part and the other descriptors as the 328511c537bSShay Agroskin `skb`'s frags. 329132db935SJakub Kicinski 330132db935SJakub Kicinski- The new SKB is updated with the necessary information (protocol, 331511c537bSShay Agroskin checksum hw verify result, etc), and then passed to the network 332511c537bSShay Agroskin stack, using the NAPI interface function :code:`napi_gro_receive()`. 333f7d625adSDavid Arinzon 334f7d625adSDavid ArinzonDynamic RX Buffers (DRB) 335f7d625adSDavid Arinzon------------------------ 336f7d625adSDavid Arinzon 337f7d625adSDavid ArinzonEach RX descriptor in the RX ring is a single memory page (which is either 4KB 338f7d625adSDavid Arinzonor 16KB long depending on system's configurations). 339f7d625adSDavid ArinzonTo reduce the memory allocations required when dealing with a high rate of small 340f7d625adSDavid Arinzonpackets, the driver tries to reuse the remaining RX descriptor's space if more 341f7d625adSDavid Arinzonthan 2KB of this page remain unused. 342f7d625adSDavid Arinzon 343f7d625adSDavid ArinzonA simple example of this mechanism is the following sequence of events: 344f7d625adSDavid Arinzon 345f7d625adSDavid Arinzon:: 346f7d625adSDavid Arinzon 347f7d625adSDavid Arinzon 1. Driver allocates page-sized RX buffer and passes it to hardware 348f7d625adSDavid Arinzon +----------------------+ 349f7d625adSDavid Arinzon |4KB RX Buffer | 350f7d625adSDavid Arinzon +----------------------+ 351f7d625adSDavid Arinzon 352f7d625adSDavid Arinzon 2. A 300Bytes packet is received on this buffer 353f7d625adSDavid Arinzon 354f7d625adSDavid Arinzon 3. The driver increases the ref count on this page and returns it back to 355f7d625adSDavid Arinzon HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes 356f7d625adSDavid Arinzon +----+--------------------+ 357f7d625adSDavid Arinzon |****|3796 Bytes RX Buffer| 358f7d625adSDavid Arinzon +----+--------------------+ 359f7d625adSDavid Arinzon 360f7d625adSDavid ArinzonThis mechanism isn't used when an XDP program is loaded, or when the 361f7d625adSDavid ArinzonRX packet is less than rx_copybreak bytes (in which case the packet is 362f7d625adSDavid Arinzoncopied out of the RX buffer into the linear part of a new skb allocated 363f7d625adSDavid Arinzonfor it and the RX buffer remains the same size, see `RX copybreak`_). 364