1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 3================== 4Kernel TLS offload 5================== 6 7Kernel TLS operation 8==================== 9 10Linux kernel provides TLS connection offload infrastructure. Once a TCP 11connection is in ``ESTABLISHED`` state user space can enable the TLS Upper 12Layer Protocol (ULP) and install the cryptographic connection state. 13For details regarding the user-facing interface refer to the TLS 14documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`. 15 16``ktls`` can operate in three modes: 17 18 * Software crypto mode (``TLS_SW``) - CPU handles the cryptography. 19 In most basic cases only crypto operations synchronous with the CPU 20 can be used, but depending on calling context CPU may utilize 21 asynchronous crypto accelerators. The use of accelerators introduces extra 22 latency on socket reads (decryption only starts when a read syscall 23 is made) and additional I/O load on the system. 24 * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto 25 on a packet by packet basis, provided the packets arrive in order. 26 This mode integrates best with the kernel stack and is described in detail 27 in the remaining part of this document 28 (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``). 29 * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where 30 NIC driver and firmware replace the kernel networking stack 31 with its own TCP handling, it is not usable in production environments 32 making use of the Linux networking stack for example any firewalling 33 abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``). 34 35The operation mode is selected automatically based on device configuration, 36offload opt-in or opt-out on per-connection basis is not currently supported. 37 38TX 39-- 40 41At a high level user write requests are turned into a scatter list, the TLS ULP 42intercepts them, inserts record framing, performs encryption (in ``TLS_SW`` 43mode) and then hands the modified scatter list to the TCP layer. From this 44point on the TCP stack proceeds as normal. 45 46In ``TLS_HW`` mode the encryption is not performed in the TLS ULP. 47Instead packets reach a device driver, the driver will mark the packets 48for crypto offload based on the socket the packet is attached to, 49and send them to the device for encryption and transmission. 50 51RX 52-- 53 54On the receive side, if the device handled decryption and authentication 55successfully, the driver will set the decrypted bit in the associated 56:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and 57are handled normally. ``ktls`` is informed when data is queued to the socket 58and the ``strparser`` mechanism is used to delineate the records. Upon read 59request, records are retrieved from the socket and passed to decryption routine. 60If device decrypted all the segments of the record the decryption is skipped, 61otherwise software path handles decryption. 62 63.. kernel-figure:: tls-offload-layers.svg 64 :alt: TLS offload layers 65 :align: center 66 :figwidth: 28em 67 68 Layers of Kernel TLS stack 69 70Device configuration 71==================== 72 73During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and 74``NETIF_F_HW_TLS_TX`` features and installs its 75:c:type:`struct tlsdev_ops <tlsdev_ops>` 76pointer in the :c:member:`tlsdev_ops` member of the 77:c:type:`struct net_device <net_device>`. 78 79When TLS cryptographic connection state is installed on a ``ktls`` socket 80(note that it is done twice, once for RX and once for TX direction, 81and the two are completely independent), the kernel checks if the underlying 82network device is offload-capable and attempts the offload. In case offload 83fails the connection is handled entirely in software using the same mechanism 84as if the offload was never tried. 85 86Offload request is performed via the :c:member:`tls_dev_add` callback of 87:c:type:`struct tlsdev_ops <tlsdev_ops>`: 88 89.. code-block:: c 90 91 int (*tls_dev_add)(struct net_device *netdev, struct sock *sk, 92 enum tls_offload_ctx_dir direction, 93 struct tls_crypto_info *crypto_info, 94 u32 start_offload_tcp_sn); 95 96``direction`` indicates whether the cryptographic information is for 97the received or transmitted packets. Driver uses the ``sk`` parameter 98to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6). 99Cryptographic information in ``crypto_info`` includes the key, iv, salt 100as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates 101which TCP sequence number corresponds to the beginning of the record with 102sequence number from ``crypto_info``. The driver can add its state 103at the end of kernel structures (see :c:member:`driver_state` members 104in ``include/net/tls.h``) to avoid additional allocations and pointer 105dereferences. 106 107TX 108-- 109 110After TX state is installed, the stack guarantees that the first segment 111of the stream will start exactly at the ``start_offload_tcp_sn`` sequence 112number, simplifying TCP sequence number matching. 113 114TX offload being fully initialized does not imply that all segments passing 115through the driver and which belong to the offloaded socket will be after 116the expected sequence number and will have kernel record information. 117In particular, already encrypted data may have been queued to the socket 118before installing the connection state in the kernel. 119 120RX 121-- 122 123In the RX direction, the local networking stack has little control over 124segmentation, so the initial records' TCP sequence number may be anywhere 125inside the segment. 126 127Normal operation 128================ 129 130At the minimum the device maintains the following state for each connection, in 131each direction: 132 133 * crypto secrets (key, iv, salt) 134 * crypto processing state (partial blocks, partial authentication tag, etc.) 135 * record metadata (sequence number, processing offset and length) 136 * expected TCP sequence number 137 138There are no guarantees on record length or record segmentation. In particular 139segments may start at any point of a record and contain any number of records. 140Assuming segments are received in order, the device should be able to perform 141crypto operations and authentication regardless of segmentation. For this 142to be possible, the device has to keep a small amount of segment-to-segment 143state. This includes at least: 144 145 * partial headers (if a segment carried only a part of the TLS header) 146 * partial data block 147 * partial authentication tag (all data had been seen but part of the 148 authentication tag has to be written or read from the subsequent segment) 149 150Record reassembly is not necessary for TLS offload. If the packets arrive 151in order the device should be able to handle them separately and make 152forward progress. 153 154TX 155-- 156 157The kernel stack performs record framing reserving space for the authentication 158tag and populating all other TLS header and tailer fields. 159 160Both the device and the driver maintain expected TCP sequence numbers 161due to the possibility of retransmissions and the lack of software fallback 162once the packet reaches the device. 163For segments passed in order, the driver marks the packets with 164a connection identifier (note that a 5-tuple lookup is insufficient to identify 165packets requiring HW offload, see the :ref:`5tuple_problems` section) 166and hands them to the device. The device identifies the packet as requiring 167TLS handling and confirms the sequence number matches its expectation. 168The device performs encryption and authentication of the record data. 169It replaces the authentication tag and TCP checksum with correct values. 170 171RX 172-- 173 174Before a packet is DMAed to the host (but after NIC's embedded switching 175and packet transformation functions) the device validates the Layer 4 176checksum and performs a 5-tuple lookup to find any TLS connection the packet 177may belong to (technically a 4-tuple 178lookup is sufficient - IP addresses and TCP port numbers, as the protocol 179is always TCP). If the packet is matched to a connection, the device confirms 180if the TCP sequence number is the expected one and proceeds to TLS handling 181(record delineation, decryption, authentication for each record in the packet). 182The device leaves the record framing unmodified, the stack takes care of record 183decapsulation. Device indicates successful handling of TLS offload in the 184per-packet context (descriptor) passed to the host. 185 186Upon reception of a TLS offloaded packet, the driver sets 187the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>` 188corresponding to the segment. Networking stack makes sure decrypted 189and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer) 190and takes care of partial decryption. 191 192Resync handling 193=============== 194 195In presence of packet drops or network packet reordering, the device may lose 196synchronization with the TLS stream, and require a resync with the kernel's 197TCP stack. 198 199Note that resync is only attempted for connections which were successfully 200added to the device table and are in TLS_HW mode. For example, 201if the table was full when cryptographic state was installed in the kernel, 202such connection will never get offloaded. Therefore the resync request 203does not carry any cryptographic connection state. 204 205TX 206-- 207 208Segments transmitted from an offloaded socket can get out of sync 209in similar ways to the receive side-retransmissions - local drops 210are possible, though network reorders are not. There are currently 211two mechanisms for dealing with out of order segments. 212 213Crypto state rebuilding 214~~~~~~~~~~~~~~~~~~~~~~~ 215 216Whenever an out of order segment is transmitted the driver provides 217the device with enough information to perform cryptographic operations. 218This means most likely that the part of the record preceding the current 219segment has to be passed to the device as part of the packet context, 220together with its TCP sequence number and TLS record number. The device 221can then initialize its crypto state, process and discard the preceding 222data (to be able to insert the authentication tag) and move onto handling 223the actual packet. 224 225In this mode depending on the implementation the driver can either ask 226for a continuation with the crypto state and the new sequence number 227(next expected segment is the one after the out of order one), or continue 228with the previous stream state - assuming that the out of order segment 229was just a retransmission. The former is simpler, and does not require 230retransmission detection therefore it is the recommended method until 231such time it is proven inefficient. 232 233Next record sync 234~~~~~~~~~~~~~~~~ 235 236Whenever an out of order segment is detected the driver requests 237that the ``ktls`` software fallback code encrypt it. If the segment's 238sequence number is lower than expected the driver assumes retransmission 239and doesn't change device state. If the segment is in the future, it 240may imply a local drop, the driver asks the stack to sync the device 241to the next record state and falls back to software. 242 243Resync request is indicated with: 244 245.. code-block:: c 246 247 void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq) 248 249Until resync is complete driver should not access its expected TCP 250sequence number (as it will be updated from a different context). 251Following helper should be used to test if resync is complete: 252 253.. code-block:: c 254 255 bool tls_offload_tx_resync_pending(struct sock *sk) 256 257Next time ``ktls`` pushes a record it will first send its TCP sequence number 258and TLS record number to the driver. Stack will also make sure that 259the new record will start on a segment boundary (like it does when 260the connection is initially added). 261 262RX 263-- 264 265A small amount of RX reorder events may not require a full resynchronization. 266In particular the device should not lose synchronization 267when record boundary can be recovered: 268 269.. kernel-figure:: tls-offload-reorder-good.svg 270 :alt: reorder of non-header segment 271 :align: center 272 273 Reorder of non-header segment 274 275Green segments are successfully decrypted, blue ones are passed 276as received on wire, red stripes mark start of new records. 277 278In above case segment 1 is received and decrypted successfully. 279Segment 2 was dropped so 3 arrives out of order. The device knows 280the next record starts inside 3, based on record length in segment 1. 281Segment 3 is passed untouched, because due to lack of data from segment 2 282the remainder of the previous record inside segment 3 cannot be handled. 283The device can, however, collect the authentication algorithm's state 284and partial block from the new record in segment 3 and when 4 and 5 285arrive continue decryption. Finally when 2 arrives it's completely outside 286of expected window of the device so it's passed as is without special 287handling. ``ktls`` software fallback handles the decryption of record 288spanning segments 1, 2 and 3. The device did not get out of sync, 289even though two segments did not get decrypted. 290 291Kernel synchronization may be necessary if the lost segment contained 292a record header and arrived after the next record header has already passed: 293 294.. kernel-figure:: tls-offload-reorder-bad.svg 295 :alt: reorder of header segment 296 :align: center 297 298 Reorder of segment with a TLS header 299 300In this example segment 2 gets dropped, and it contains a record header. 301Device can only detect that segment 4 also contains a TLS header 302if it knows the length of the previous record from segment 2. In this case 303the device will lose synchronization with the stream. 304 305Stream scan resynchronization 306~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 307 308When the device gets out of sync and the stream reaches TCP sequence 309numbers more than a max size record past the expected TCP sequence number, 310the device starts scanning for a known header pattern. For example 311for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur 312in the SSL/TLS version field of the header. Once pattern is matched 313the device continues attempting parsing headers at expected locations 314(based on the length fields at guessed locations). 315Whenever the expected location does not contain a valid header the scan 316is restarted. 317 318When the header is matched the device sends a confirmation request 319to the kernel, asking if the guessed location is correct (if a TLS record 320really starts there), and which record sequence number the given header had. 321The kernel confirms the guessed location was correct and tells the device 322the record sequence number. Meanwhile, the device had been parsing 323and counting all records since the just-confirmed one, it adds the number 324of records it had seen to the record number provided by the kernel. 325At this point the device is in sync and can resume decryption at next 326segment boundary. 327 328In a pathological case the device may latch onto a sequence of matching 329headers and never hear back from the kernel (there is no negative 330confirmation from the kernel). The implementation may choose to periodically 331restart scan. Given how unlikely falsely-matching stream is, however, 332periodic restart is not deemed necessary. 333 334Special care has to be taken if the confirmation request is passed 335asynchronously to the packet stream and record may get processed 336by the kernel before the confirmation request. 337 338Stack-driven resynchronization 339~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 340 341The driver may also request the stack to perform resynchronization 342whenever it sees the records are no longer getting decrypted. 343If the connection is configured in this mode the stack automatically 344schedules resynchronization after it has received two completely encrypted 345records. 346 347The stack waits for the socket to drain and informs the device about 348the next expected record number and its TCP sequence number. If the 349records continue to be received fully encrypted stack retries the 350synchronization with an exponential back off (first after 2 encrypted 351records, then after 4 records, after 8, after 16... up until every 352128 records). 353 354Error handling 355============== 356 357TX 358-- 359 360Packets may be redirected or rerouted by the stack to a different 361device than the selected TLS offload device. The stack will handle 362such condition using the :c:func:`sk_validate_xmit_skb` helper 363(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook). 364Offload maintains information about all records until the data is 365fully acknowledged, so if skbs reach the wrong device they can be handled 366by software fallback. 367 368Any device TLS offload handling error on the transmission side must result 369in the packet being dropped. For example if a packet got out of order 370due to a bug in the stack or the device, reached the device and can't 371be encrypted such packet must be dropped. 372 373RX 374-- 375 376If the device encounters any problems with TLS offload on the receive 377side it should pass the packet to the host's networking stack as it was 378received on the wire. 379 380For example authentication failure for any record in the segment should 381result in passing the unmodified packet to the software fallback. This means 382packets should not be modified "in place". Splitting segments to handle partial 383decryption is not advised. In other words either all records in the packet 384had been handled successfully and authenticated or the packet has to be passed 385to the host's stack as it was on the wire (recovering original packet in the 386driver if device provides precise error is sufficient). 387 388The Linux networking stack does not provide a way of reporting per-packet 389decryption and authentication errors, packets with errors must simply not 390have the :c:member:`decrypted` mark set. 391 392A packet should also not be handled by the TLS offload if it contains 393incorrect checksums. 394 395Performance metrics 396=================== 397 398TLS offload can be characterized by the following basic metrics: 399 400 * max connection count 401 * connection installation rate 402 * connection installation latency 403 * total cryptographic performance 404 405Note that each TCP connection requires a TLS session in both directions, 406the performance may be reported treating each direction separately. 407 408Max connection count 409-------------------- 410 411The number of connections device can support can be exposed via 412``devlink resource`` API. 413 414Total cryptographic performance 415------------------------------- 416 417Offload performance may depend on segment and record size. 418 419Overload of the cryptographic subsystem of the device should not have 420significant performance impact on non-offloaded streams. 421 422Statistics 423========== 424 425Following minimum set of TLS-related statistics should be reported 426by the driver: 427 428 * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets 429 which were part of a TLS stream. 430 * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets 431 which were successfully decrypted. 432 * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for 433 decryption. 434 * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device 435 (connection has finished). 436 * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync 437 request. 438 * ``rx_tls_resync_req_start`` - number of times the TLS async resync request 439 was started. 440 * ``rx_tls_resync_req_end`` - number of times the TLS async resync request 441 properly ended with providing the HW tracked tcp-seq. 442 * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request 443 procedure was started but not properly ended. 444 * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to 445 the driver was successfully handled. 446 * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to 447 the driver was terminated unsuccessfully. 448 * ``rx_tls_err`` - number of RX packets which were part of a TLS stream 449 but were not decrypted due to unexpected error in the state machine. 450 * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device 451 for encryption of their TLS payload. 452 * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets 453 passed to the device for encryption. 454 * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for 455 encryption. 456 * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream 457 but did not arrive in the expected order. 458 * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of 459 a TLS stream and arrived out-of-order, but skipped the HW offload routine 460 and went to the regular transmit flow as they were retransmissions of the 461 connection handshake. 462 * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of 463 a TLS stream dropped, because they arrived out of order and associated 464 record could not be found. 465 * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS 466 stream dropped, because they contain both data that has been encrypted by 467 software and data that expects hardware crypto offload. 468 469Notable corner cases, exceptions and additional requirements 470============================================================ 471 472.. _5tuple_problems: 473 4745-tuple matching limitations 475---------------------------- 476 477The device can only recognize received packets based on the 5-tuple 478of the socket. Current ``ktls`` implementation will not offload sockets 479routed through software interfaces such as those used for tunneling 480or virtual networking. However, many packet transformations performed 481by the networking stack (most notably any BPF logic) do not require 482any intermediate software device, therefore a 5-tuple match may 483consistently miss at the device level. In such cases the device 484should still be able to perform TX offload (encryption) and should 485fallback cleanly to software decryption (RX). 486 487Out of order 488------------ 489 490Introducing extra processing in NICs should not cause packets to be 491transmitted or received out of order, for example pure ACK packets 492should not be reordered with respect to data segments. 493 494Ingress reorder 495--------------- 496 497A device is permitted to perform packet reordering for consecutive 498TCP segments (i.e. placing packets in the correct order) but any form 499of additional buffering is disallowed. 500 501Coexistence with standard networking offload features 502----------------------------------------------------- 503 504Offloaded ``ktls`` sockets should support standard TCP stack features 505transparently. Enabling device TLS offload should not cause any difference 506in packets as seen on the wire. 507 508Transport layer transparency 509---------------------------- 510 511For the purpose of simplifying TLS offload, the device should not modify any 512packet headers. 513 514The device should not depend on any packet headers beyond what is strictly 515necessary for TLS offload. 516 517Segment drops 518------------- 519 520Dropping packets is acceptable only in the event of catastrophic 521system errors and should never be used as an error handling mechanism 522in cases arising from normal operation. In other words, reliance 523on TCP retransmissions to handle corner cases is not acceptable. 524 525TLS device features 526------------------- 527 528Drivers should ignore the changes to the TLS device feature flags. 529These flags will be acted upon accordingly by the core ``ktls`` code. 530TLS device feature flags only control adding of new TLS connection 531offloads, old connections will remain active after flags are cleared. 532 533TLS encryption cannot be offloaded to devices without checksum calculation 534offload. Hence, TLS TX device feature flag requires TX csum offload being set. 535Disabling the latter implies clearing the former. Disabling TX checksum offload 536should not affect old connections, and drivers should make sure checksum 537calculation does not break for them. 538Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user 539does not want to enable RX csum offload, TLS RX device feature is disabled 540as well. 541