1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 3================== 4Kernel TLS offload 5================== 6 7Kernel TLS operation 8==================== 9 10Linux kernel provides TLS connection offload infrastructure. Once a TCP 11connection is in ``ESTABLISHED`` state user space can enable the TLS Upper 12Layer Protocol (ULP) and install the cryptographic connection state. 13For details regarding the user-facing interface refer to the TLS 14documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`. 15 16``ktls`` can operate in two modes: 17 18 * Software crypto mode (``TLS_SW``) - CPU handles the cryptography. 19 In most basic cases only crypto operations synchronous with the CPU 20 can be used, but depending on calling context CPU may utilize 21 asynchronous crypto accelerators. The use of accelerators introduces extra 22 latency on socket reads (decryption only starts when a read syscall 23 is made) and additional I/O load on the system. 24 * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto 25 on a packet by packet basis, provided the packets arrive in order. 26 This mode integrates best with the kernel stack and is described in detail 27 in the remaining part of this document 28 (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``). 29 30The operation mode is selected automatically based on device configuration, 31offload opt-in or opt-out on per-connection basis is not currently supported. 32 33TX 34-- 35 36At a high level user write requests are turned into a scatter list, the TLS ULP 37intercepts them, inserts record framing, performs encryption (in ``TLS_SW`` 38mode) and then hands the modified scatter list to the TCP layer. From this 39point on the TCP stack proceeds as normal. 40 41In ``TLS_HW`` mode the encryption is not performed in the TLS ULP. 42Instead packets reach a device driver, the driver will mark the packets 43for crypto offload based on the socket the packet is attached to, 44and send them to the device for encryption and transmission. 45 46RX 47-- 48 49On the receive side, if the device handled decryption and authentication 50successfully, the driver will set the decrypted bit in the associated 51:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and 52are handled normally. ``ktls`` is informed when data is queued to the socket 53and the ``strparser`` mechanism is used to delineate the records. Upon read 54request, records are retrieved from the socket and passed to decryption routine. 55If device decrypted all the segments of the record the decryption is skipped, 56otherwise software path handles decryption. 57 58.. kernel-figure:: tls-offload-layers.svg 59 :alt: TLS offload layers 60 :align: center 61 :figwidth: 28em 62 63 Layers of Kernel TLS stack 64 65Device configuration 66==================== 67 68During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and 69``NETIF_F_HW_TLS_TX`` features and installs its 70:c:type:`struct tlsdev_ops <tlsdev_ops>` 71pointer in the :c:member:`tlsdev_ops` member of the 72:c:type:`struct net_device <net_device>`. 73 74When TLS cryptographic connection state is installed on a ``ktls`` socket 75(note that it is done twice, once for RX and once for TX direction, 76and the two are completely independent), the kernel checks if the underlying 77network device is offload-capable and attempts the offload. In case offload 78fails the connection is handled entirely in software using the same mechanism 79as if the offload was never tried. 80 81Offload request is performed via the :c:member:`tls_dev_add` callback of 82:c:type:`struct tlsdev_ops <tlsdev_ops>`: 83 84.. code-block:: c 85 86 int (*tls_dev_add)(struct net_device *netdev, struct sock *sk, 87 enum tls_offload_ctx_dir direction, 88 struct tls_crypto_info *crypto_info, 89 u32 start_offload_tcp_sn); 90 91``direction`` indicates whether the cryptographic information is for 92the received or transmitted packets. Driver uses the ``sk`` parameter 93to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6). 94Cryptographic information in ``crypto_info`` includes the key, iv, salt 95as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates 96which TCP sequence number corresponds to the beginning of the record with 97sequence number from ``crypto_info``. The driver can add its state 98at the end of kernel structures (see :c:member:`driver_state` members 99in ``include/net/tls.h``) to avoid additional allocations and pointer 100dereferences. 101 102TX 103-- 104 105After TX state is installed, the stack guarantees that the first segment 106of the stream will start exactly at the ``start_offload_tcp_sn`` sequence 107number, simplifying TCP sequence number matching. 108 109TX offload being fully initialized does not imply that all segments passing 110through the driver and which belong to the offloaded socket will be after 111the expected sequence number and will have kernel record information. 112In particular, already encrypted data may have been queued to the socket 113before installing the connection state in the kernel. 114 115RX 116-- 117 118In the RX direction, the local networking stack has little control over 119segmentation, so the initial records' TCP sequence number may be anywhere 120inside the segment. 121 122Normal operation 123================ 124 125At the minimum the device maintains the following state for each connection, in 126each direction: 127 128 * crypto secrets (key, iv, salt) 129 * crypto processing state (partial blocks, partial authentication tag, etc.) 130 * record metadata (sequence number, processing offset and length) 131 * expected TCP sequence number 132 133There are no guarantees on record length or record segmentation. In particular 134segments may start at any point of a record and contain any number of records. 135Assuming segments are received in order, the device should be able to perform 136crypto operations and authentication regardless of segmentation. For this 137to be possible, the device has to keep a small amount of segment-to-segment 138state. This includes at least: 139 140 * partial headers (if a segment carried only a part of the TLS header) 141 * partial data block 142 * partial authentication tag (all data had been seen but part of the 143 authentication tag has to be written or read from the subsequent segment) 144 145Record reassembly is not necessary for TLS offload. If the packets arrive 146in order the device should be able to handle them separately and make 147forward progress. 148 149TX 150-- 151 152The kernel stack performs record framing reserving space for the authentication 153tag and populating all other TLS header and tailer fields. 154 155Both the device and the driver maintain expected TCP sequence numbers 156due to the possibility of retransmissions and the lack of software fallback 157once the packet reaches the device. 158For segments passed in order, the driver marks the packets with 159a connection identifier (note that a 5-tuple lookup is insufficient to identify 160packets requiring HW offload, see the :ref:`5tuple_problems` section) 161and hands them to the device. The device identifies the packet as requiring 162TLS handling and confirms the sequence number matches its expectation. 163The device performs encryption and authentication of the record data. 164It replaces the authentication tag and TCP checksum with correct values. 165 166RX 167-- 168 169Before a packet is DMAed to the host (but after NIC's embedded switching 170and packet transformation functions) the device validates the Layer 4 171checksum and performs a 5-tuple lookup to find any TLS connection the packet 172may belong to (technically a 4-tuple 173lookup is sufficient - IP addresses and TCP port numbers, as the protocol 174is always TCP). If the packet is matched to a connection, the device confirms 175if the TCP sequence number is the expected one and proceeds to TLS handling 176(record delineation, decryption, authentication for each record in the packet). 177The device leaves the record framing unmodified, the stack takes care of record 178decapsulation. Device indicates successful handling of TLS offload in the 179per-packet context (descriptor) passed to the host. 180 181Upon reception of a TLS offloaded packet, the driver sets 182the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>` 183corresponding to the segment. Networking stack makes sure decrypted 184and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer) 185and takes care of partial decryption. 186 187Resync handling 188=============== 189 190In presence of packet drops or network packet reordering, the device may lose 191synchronization with the TLS stream, and require a resync with the kernel's 192TCP stack. 193 194Note that resync is only attempted for connections which were successfully 195added to the device table and are in TLS_HW mode. For example, 196if the table was full when cryptographic state was installed in the kernel, 197such connection will never get offloaded. Therefore the resync request 198does not carry any cryptographic connection state. 199 200TX 201-- 202 203Segments transmitted from an offloaded socket can get out of sync 204in similar ways to the receive side-retransmissions - local drops 205are possible, though network reorders are not. There are currently 206two mechanisms for dealing with out of order segments. 207 208Crypto state rebuilding 209~~~~~~~~~~~~~~~~~~~~~~~ 210 211Whenever an out of order segment is transmitted the driver provides 212the device with enough information to perform cryptographic operations. 213This means most likely that the part of the record preceding the current 214segment has to be passed to the device as part of the packet context, 215together with its TCP sequence number and TLS record number. The device 216can then initialize its crypto state, process and discard the preceding 217data (to be able to insert the authentication tag) and move onto handling 218the actual packet. 219 220In this mode depending on the implementation the driver can either ask 221for a continuation with the crypto state and the new sequence number 222(next expected segment is the one after the out of order one), or continue 223with the previous stream state - assuming that the out of order segment 224was just a retransmission. The former is simpler, and does not require 225retransmission detection therefore it is the recommended method until 226such time it is proven inefficient. 227 228Next record sync 229~~~~~~~~~~~~~~~~ 230 231Whenever an out of order segment is detected the driver requests 232that the ``ktls`` software fallback code encrypt it. If the segment's 233sequence number is lower than expected the driver assumes retransmission 234and doesn't change device state. If the segment is in the future, it 235may imply a local drop, the driver asks the stack to sync the device 236to the next record state and falls back to software. 237 238Resync request is indicated with: 239 240.. code-block:: c 241 242 void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq) 243 244Until resync is complete driver should not access its expected TCP 245sequence number (as it will be updated from a different context). 246Following helper should be used to test if resync is complete: 247 248.. code-block:: c 249 250 bool tls_offload_tx_resync_pending(struct sock *sk) 251 252Next time ``ktls`` pushes a record it will first send its TCP sequence number 253and TLS record number to the driver. Stack will also make sure that 254the new record will start on a segment boundary (like it does when 255the connection is initially added). 256 257RX 258-- 259 260A small amount of RX reorder events may not require a full resynchronization. 261In particular the device should not lose synchronization 262when record boundary can be recovered: 263 264.. kernel-figure:: tls-offload-reorder-good.svg 265 :alt: reorder of non-header segment 266 :align: center 267 268 Reorder of non-header segment 269 270Green segments are successfully decrypted, blue ones are passed 271as received on wire, red stripes mark start of new records. 272 273In above case segment 1 is received and decrypted successfully. 274Segment 2 was dropped so 3 arrives out of order. The device knows 275the next record starts inside 3, based on record length in segment 1. 276Segment 3 is passed untouched, because due to lack of data from segment 2 277the remainder of the previous record inside segment 3 cannot be handled. 278The device can, however, collect the authentication algorithm's state 279and partial block from the new record in segment 3 and when 4 and 5 280arrive continue decryption. Finally when 2 arrives it's completely outside 281of expected window of the device so it's passed as is without special 282handling. ``ktls`` software fallback handles the decryption of record 283spanning segments 1, 2 and 3. The device did not get out of sync, 284even though two segments did not get decrypted. 285 286Kernel synchronization may be necessary if the lost segment contained 287a record header and arrived after the next record header has already passed: 288 289.. kernel-figure:: tls-offload-reorder-bad.svg 290 :alt: reorder of header segment 291 :align: center 292 293 Reorder of segment with a TLS header 294 295In this example segment 2 gets dropped, and it contains a record header. 296Device can only detect that segment 4 also contains a TLS header 297if it knows the length of the previous record from segment 2. In this case 298the device will lose synchronization with the stream. 299 300Stream scan resynchronization 301~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 302 303When the device gets out of sync and the stream reaches TCP sequence 304numbers more than a max size record past the expected TCP sequence number, 305the device starts scanning for a known header pattern. For example 306for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur 307in the SSL/TLS version field of the header. Once pattern is matched 308the device continues attempting parsing headers at expected locations 309(based on the length fields at guessed locations). 310Whenever the expected location does not contain a valid header the scan 311is restarted. 312 313When the header is matched the device sends a confirmation request 314to the kernel, asking if the guessed location is correct (if a TLS record 315really starts there), and which record sequence number the given header had. 316 317The asynchronous resync process is coordinated on the kernel side using 318struct tls_offload_resync_async, which tracks and manages the resync request. 319 320Helper functions to manage struct tls_offload_resync_async: 321 322``tls_offload_rx_resync_async_request_start()`` 323Initializes an asynchronous resync attempt by specifying the sequence range to 324monitor and resetting internal state in the struct. 325 326``tls_offload_rx_resync_async_request_end()`` 327Retains the device's guessed TCP sequence number for comparison with current or 328future logged ones. It also clears the RESYNC_REQ_ASYNC flag from the resync 329request, indicating that the device has submitted its guessed sequence number. 330 331``tls_offload_rx_resync_async_request_cancel()`` 332Cancels any in-progress resync attempt, clearing the request state. 333 334When the kernel processes an RX segment that begins a new TLS record, it 335examines the current status of the asynchronous resynchronization request. 336 337If the device is still waiting to provide its guessed TCP sequence number 338(the async state), the kernel records the sequence number of this segment so 339that it can later be compared once the device's guess becomes available. 340 341If the device has already submitted its guessed sequence number (the non-async 342state), the kernel now tries to match that guess against the sequence numbers of 343all TLS record headers that have been logged since the resync request 344started. 345 346The kernel confirms the guessed location was correct and tells the device 347the record sequence number. Meanwhile, the device had been parsing 348and counting all records since the just-confirmed one, it adds the number 349of records it had seen to the record number provided by the kernel. 350At this point the device is in sync and can resume decryption at next 351segment boundary. 352 353In a pathological case the device may latch onto a sequence of matching 354headers and never hear back from the kernel (there is no negative 355confirmation from the kernel). The implementation may choose to periodically 356restart scan. Given how unlikely falsely-matching stream is, however, 357periodic restart is not deemed necessary. 358 359Special care has to be taken if the confirmation request is passed 360asynchronously to the packet stream and record may get processed 361by the kernel before the confirmation request. 362 363Stack-driven resynchronization 364~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 365 366The driver may also request the stack to perform resynchronization 367whenever it sees the records are no longer getting decrypted. 368If the connection is configured in this mode the stack automatically 369schedules resynchronization after it has received two completely encrypted 370records. 371 372The stack waits for the socket to drain and informs the device about 373the next expected record number and its TCP sequence number. If the 374records continue to be received fully encrypted stack retries the 375synchronization with an exponential back off (first after 2 encrypted 376records, then after 4 records, after 8, after 16... up until every 377128 records). 378 379Error handling 380============== 381 382TX 383-- 384 385Packets may be redirected or rerouted by the stack to a different 386device than the selected TLS offload device. The stack will handle 387such condition using the :c:func:`sk_validate_xmit_skb` helper 388(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook). 389Offload maintains information about all records until the data is 390fully acknowledged, so if skbs reach the wrong device they can be handled 391by software fallback. 392 393Any device TLS offload handling error on the transmission side must result 394in the packet being dropped. For example if a packet got out of order 395due to a bug in the stack or the device, reached the device and can't 396be encrypted such packet must be dropped. 397 398RX 399-- 400 401If the device encounters any problems with TLS offload on the receive 402side it should pass the packet to the host's networking stack as it was 403received on the wire. 404 405For example authentication failure for any record in the segment should 406result in passing the unmodified packet to the software fallback. This means 407packets should not be modified "in place". Splitting segments to handle partial 408decryption is not advised. In other words either all records in the packet 409had been handled successfully and authenticated or the packet has to be passed 410to the host's stack as it was on the wire (recovering original packet in the 411driver if device provides precise error is sufficient). 412 413The Linux networking stack does not provide a way of reporting per-packet 414decryption and authentication errors, packets with errors must simply not 415have the :c:member:`decrypted` mark set. 416 417A packet should also not be handled by the TLS offload if it contains 418incorrect checksums. 419 420Performance metrics 421=================== 422 423TLS offload can be characterized by the following basic metrics: 424 425 * max connection count 426 * connection installation rate 427 * connection installation latency 428 * total cryptographic performance 429 430Note that each TCP connection requires a TLS session in both directions, 431the performance may be reported treating each direction separately. 432 433Max connection count 434-------------------- 435 436The number of connections device can support can be exposed via 437``devlink resource`` API. 438 439Total cryptographic performance 440------------------------------- 441 442Offload performance may depend on segment and record size. 443 444Overload of the cryptographic subsystem of the device should not have 445significant performance impact on non-offloaded streams. 446 447Statistics 448========== 449 450Following minimum set of TLS-related statistics should be reported 451by the driver: 452 453 * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets 454 which were part of a TLS stream. 455 * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets 456 which were successfully decrypted. 457 * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for 458 decryption. 459 * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device 460 (connection has finished). 461 * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync 462 request. 463 * ``rx_tls_resync_req_start`` - number of times the TLS async resync request 464 was started. 465 * ``rx_tls_resync_req_end`` - number of times the TLS async resync request 466 properly ended with providing the HW tracked tcp-seq. 467 * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request 468 procedure was started but not properly ended. 469 * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to 470 the driver was successfully handled. 471 * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to 472 the driver was terminated unsuccessfully. 473 * ``rx_tls_err`` - number of RX packets which were part of a TLS stream 474 but were not decrypted due to unexpected error in the state machine. 475 * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device 476 for encryption of their TLS payload. 477 * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets 478 passed to the device for encryption. 479 * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for 480 encryption. 481 * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream 482 but did not arrive in the expected order. 483 * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of 484 a TLS stream and arrived out-of-order, but skipped the HW offload routine 485 and went to the regular transmit flow as they were retransmissions of the 486 connection handshake. 487 * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of 488 a TLS stream dropped, because they arrived out of order and associated 489 record could not be found. 490 * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS 491 stream dropped, because they contain both data that has been encrypted by 492 software and data that expects hardware crypto offload. 493 494Notable corner cases, exceptions and additional requirements 495============================================================ 496 497.. _5tuple_problems: 498 4995-tuple matching limitations 500---------------------------- 501 502The device can only recognize received packets based on the 5-tuple 503of the socket. Current ``ktls`` implementation will not offload sockets 504routed through software interfaces such as those used for tunneling 505or virtual networking. However, many packet transformations performed 506by the networking stack (most notably any BPF logic) do not require 507any intermediate software device, therefore a 5-tuple match may 508consistently miss at the device level. In such cases the device 509should still be able to perform TX offload (encryption) and should 510fallback cleanly to software decryption (RX). 511 512Out of order 513------------ 514 515Introducing extra processing in NICs should not cause packets to be 516transmitted or received out of order, for example pure ACK packets 517should not be reordered with respect to data segments. 518 519Ingress reorder 520--------------- 521 522A device is permitted to perform packet reordering for consecutive 523TCP segments (i.e. placing packets in the correct order) but any form 524of additional buffering is disallowed. 525 526Coexistence with standard networking offload features 527----------------------------------------------------- 528 529Offloaded ``ktls`` sockets should support standard TCP stack features 530transparently. Enabling device TLS offload should not cause any difference 531in packets as seen on the wire. 532 533Transport layer transparency 534---------------------------- 535 536For the purpose of simplifying TLS offload, the device should not modify any 537packet headers. 538 539The device should not depend on any packet headers beyond what is strictly 540necessary for TLS offload. 541 542Segment drops 543------------- 544 545Dropping packets is acceptable only in the event of catastrophic 546system errors and should never be used as an error handling mechanism 547in cases arising from normal operation. In other words, reliance 548on TCP retransmissions to handle corner cases is not acceptable. 549 550TLS device features 551------------------- 552 553Drivers should ignore the changes to the TLS device feature flags. 554These flags will be acted upon accordingly by the core ``ktls`` code. 555TLS device feature flags only control adding of new TLS connection 556offloads, old connections will remain active after flags are cleared. 557 558TLS encryption cannot be offloaded to devices without checksum calculation 559offload. Hence, TLS TX device feature flag requires TX csum offload being set. 560Disabling the latter implies clearing the former. Disabling TX checksum offload 561should not affect old connections, and drivers should make sure checksum 562calculation does not break for them. 563Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user 564does not want to enable RX csum offload, TLS RX device feature is disabled 565as well. 566