xref: /linux/Documentation/networking/tls-offload.rst (revision d8f87aa5fa0a4276491fa8ef436cd22605a3f9ba)
1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
2
3==================
4Kernel TLS offload
5==================
6
7Kernel TLS operation
8====================
9
10Linux kernel provides TLS connection offload infrastructure. Once a TCP
11connection is in ``ESTABLISHED`` state user space can enable the TLS Upper
12Layer Protocol (ULP) and install the cryptographic connection state.
13For details regarding the user-facing interface refer to the TLS
14documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`.
15
16``ktls`` can operate in three modes:
17
18 * Software crypto mode (``TLS_SW``) - CPU handles the cryptography.
19   In most basic cases only crypto operations synchronous with the CPU
20   can be used, but depending on calling context CPU may utilize
21   asynchronous crypto accelerators. The use of accelerators introduces extra
22   latency on socket reads (decryption only starts when a read syscall
23   is made) and additional I/O load on the system.
24 * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto
25   on a packet by packet basis, provided the packets arrive in order.
26   This mode integrates best with the kernel stack and is described in detail
27   in the remaining part of this document
28   (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``).
29 * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where
30   NIC driver and firmware replace the kernel networking stack
31   with its own TCP handling, it is not usable in production environments
32   making use of the Linux networking stack for example any firewalling
33   abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``).
34
35The operation mode is selected automatically based on device configuration,
36offload opt-in or opt-out on per-connection basis is not currently supported.
37
38TX
39--
40
41At a high level user write requests are turned into a scatter list, the TLS ULP
42intercepts them, inserts record framing, performs encryption (in ``TLS_SW``
43mode) and then hands the modified scatter list to the TCP layer. From this
44point on the TCP stack proceeds as normal.
45
46In ``TLS_HW`` mode the encryption is not performed in the TLS ULP.
47Instead packets reach a device driver, the driver will mark the packets
48for crypto offload based on the socket the packet is attached to,
49and send them to the device for encryption and transmission.
50
51RX
52--
53
54On the receive side, if the device handled decryption and authentication
55successfully, the driver will set the decrypted bit in the associated
56:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
57are handled normally. ``ktls`` is informed when data is queued to the socket
58and the ``strparser`` mechanism is used to delineate the records. Upon read
59request, records are retrieved from the socket and passed to decryption routine.
60If device decrypted all the segments of the record the decryption is skipped,
61otherwise software path handles decryption.
62
63.. kernel-figure::  tls-offload-layers.svg
64   :alt:	TLS offload layers
65   :align:	center
66   :figwidth:	28em
67
68   Layers of Kernel TLS stack
69
70Device configuration
71====================
72
73During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and
74``NETIF_F_HW_TLS_TX`` features and installs its
75:c:type:`struct tlsdev_ops <tlsdev_ops>`
76pointer in the :c:member:`tlsdev_ops` member of the
77:c:type:`struct net_device <net_device>`.
78
79When TLS cryptographic connection state is installed on a ``ktls`` socket
80(note that it is done twice, once for RX and once for TX direction,
81and the two are completely independent), the kernel checks if the underlying
82network device is offload-capable and attempts the offload. In case offload
83fails the connection is handled entirely in software using the same mechanism
84as if the offload was never tried.
85
86Offload request is performed via the :c:member:`tls_dev_add` callback of
87:c:type:`struct tlsdev_ops <tlsdev_ops>`:
88
89.. code-block:: c
90
91	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
92			   enum tls_offload_ctx_dir direction,
93			   struct tls_crypto_info *crypto_info,
94			   u32 start_offload_tcp_sn);
95
96``direction`` indicates whether the cryptographic information is for
97the received or transmitted packets. Driver uses the ``sk`` parameter
98to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).
99Cryptographic information in ``crypto_info`` includes the key, iv, salt
100as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates
101which TCP sequence number corresponds to the beginning of the record with
102sequence number from ``crypto_info``. The driver can add its state
103at the end of kernel structures (see :c:member:`driver_state` members
104in ``include/net/tls.h``) to avoid additional allocations and pointer
105dereferences.
106
107TX
108--
109
110After TX state is installed, the stack guarantees that the first segment
111of the stream will start exactly at the ``start_offload_tcp_sn`` sequence
112number, simplifying TCP sequence number matching.
113
114TX offload being fully initialized does not imply that all segments passing
115through the driver and which belong to the offloaded socket will be after
116the expected sequence number and will have kernel record information.
117In particular, already encrypted data may have been queued to the socket
118before installing the connection state in the kernel.
119
120RX
121--
122
123In the RX direction, the local networking stack has little control over
124segmentation, so the initial records' TCP sequence number may be anywhere
125inside the segment.
126
127Normal operation
128================
129
130At the minimum the device maintains the following state for each connection, in
131each direction:
132
133 * crypto secrets (key, iv, salt)
134 * crypto processing state (partial blocks, partial authentication tag, etc.)
135 * record metadata (sequence number, processing offset and length)
136 * expected TCP sequence number
137
138There are no guarantees on record length or record segmentation. In particular
139segments may start at any point of a record and contain any number of records.
140Assuming segments are received in order, the device should be able to perform
141crypto operations and authentication regardless of segmentation. For this
142to be possible, the device has to keep a small amount of segment-to-segment
143state. This includes at least:
144
145 * partial headers (if a segment carried only a part of the TLS header)
146 * partial data block
147 * partial authentication tag (all data had been seen but part of the
148   authentication tag has to be written or read from the subsequent segment)
149
150Record reassembly is not necessary for TLS offload. If the packets arrive
151in order the device should be able to handle them separately and make
152forward progress.
153
154TX
155--
156
157The kernel stack performs record framing reserving space for the authentication
158tag and populating all other TLS header and tailer fields.
159
160Both the device and the driver maintain expected TCP sequence numbers
161due to the possibility of retransmissions and the lack of software fallback
162once the packet reaches the device.
163For segments passed in order, the driver marks the packets with
164a connection identifier (note that a 5-tuple lookup is insufficient to identify
165packets requiring HW offload, see the :ref:`5tuple_problems` section)
166and hands them to the device. The device identifies the packet as requiring
167TLS handling and confirms the sequence number matches its expectation.
168The device performs encryption and authentication of the record data.
169It replaces the authentication tag and TCP checksum with correct values.
170
171RX
172--
173
174Before a packet is DMAed to the host (but after NIC's embedded switching
175and packet transformation functions) the device validates the Layer 4
176checksum and performs a 5-tuple lookup to find any TLS connection the packet
177may belong to (technically a 4-tuple
178lookup is sufficient - IP addresses and TCP port numbers, as the protocol
179is always TCP). If the packet is matched to a connection, the device confirms
180if the TCP sequence number is the expected one and proceeds to TLS handling
181(record delineation, decryption, authentication for each record in the packet).
182The device leaves the record framing unmodified, the stack takes care of record
183decapsulation. Device indicates successful handling of TLS offload in the
184per-packet context (descriptor) passed to the host.
185
186Upon reception of a TLS offloaded packet, the driver sets
187the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
188corresponding to the segment. Networking stack makes sure decrypted
189and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)
190and takes care of partial decryption.
191
192Resync handling
193===============
194
195In presence of packet drops or network packet reordering, the device may lose
196synchronization with the TLS stream, and require a resync with the kernel's
197TCP stack.
198
199Note that resync is only attempted for connections which were successfully
200added to the device table and are in TLS_HW mode. For example,
201if the table was full when cryptographic state was installed in the kernel,
202such connection will never get offloaded. Therefore the resync request
203does not carry any cryptographic connection state.
204
205TX
206--
207
208Segments transmitted from an offloaded socket can get out of sync
209in similar ways to the receive side-retransmissions - local drops
210are possible, though network reorders are not. There are currently
211two mechanisms for dealing with out of order segments.
212
213Crypto state rebuilding
214~~~~~~~~~~~~~~~~~~~~~~~
215
216Whenever an out of order segment is transmitted the driver provides
217the device with enough information to perform cryptographic operations.
218This means most likely that the part of the record preceding the current
219segment has to be passed to the device as part of the packet context,
220together with its TCP sequence number and TLS record number. The device
221can then initialize its crypto state, process and discard the preceding
222data (to be able to insert the authentication tag) and move onto handling
223the actual packet.
224
225In this mode depending on the implementation the driver can either ask
226for a continuation with the crypto state and the new sequence number
227(next expected segment is the one after the out of order one), or continue
228with the previous stream state - assuming that the out of order segment
229was just a retransmission. The former is simpler, and does not require
230retransmission detection therefore it is the recommended method until
231such time it is proven inefficient.
232
233Next record sync
234~~~~~~~~~~~~~~~~
235
236Whenever an out of order segment is detected the driver requests
237that the ``ktls`` software fallback code encrypt it. If the segment's
238sequence number is lower than expected the driver assumes retransmission
239and doesn't change device state. If the segment is in the future, it
240may imply a local drop, the driver asks the stack to sync the device
241to the next record state and falls back to software.
242
243Resync request is indicated with:
244
245.. code-block:: c
246
247  void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq)
248
249Until resync is complete driver should not access its expected TCP
250sequence number (as it will be updated from a different context).
251Following helper should be used to test if resync is complete:
252
253.. code-block:: c
254
255  bool tls_offload_tx_resync_pending(struct sock *sk)
256
257Next time ``ktls`` pushes a record it will first send its TCP sequence number
258and TLS record number to the driver. Stack will also make sure that
259the new record will start on a segment boundary (like it does when
260the connection is initially added).
261
262RX
263--
264
265A small amount of RX reorder events may not require a full resynchronization.
266In particular the device should not lose synchronization
267when record boundary can be recovered:
268
269.. kernel-figure::  tls-offload-reorder-good.svg
270   :alt:	reorder of non-header segment
271   :align:	center
272
273   Reorder of non-header segment
274
275Green segments are successfully decrypted, blue ones are passed
276as received on wire, red stripes mark start of new records.
277
278In above case segment 1 is received and decrypted successfully.
279Segment 2 was dropped so 3 arrives out of order. The device knows
280the next record starts inside 3, based on record length in segment 1.
281Segment 3 is passed untouched, because due to lack of data from segment 2
282the remainder of the previous record inside segment 3 cannot be handled.
283The device can, however, collect the authentication algorithm's state
284and partial block from the new record in segment 3 and when 4 and 5
285arrive continue decryption. Finally when 2 arrives it's completely outside
286of expected window of the device so it's passed as is without special
287handling. ``ktls`` software fallback handles the decryption of record
288spanning segments 1, 2 and 3. The device did not get out of sync,
289even though two segments did not get decrypted.
290
291Kernel synchronization may be necessary if the lost segment contained
292a record header and arrived after the next record header has already passed:
293
294.. kernel-figure::  tls-offload-reorder-bad.svg
295   :alt:	reorder of header segment
296   :align:	center
297
298   Reorder of segment with a TLS header
299
300In this example segment 2 gets dropped, and it contains a record header.
301Device can only detect that segment 4 also contains a TLS header
302if it knows the length of the previous record from segment 2. In this case
303the device will lose synchronization with the stream.
304
305Stream scan resynchronization
306~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
307
308When the device gets out of sync and the stream reaches TCP sequence
309numbers more than a max size record past the expected TCP sequence number,
310the device starts scanning for a known header pattern. For example
311for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur
312in the SSL/TLS version field of the header. Once pattern is matched
313the device continues attempting parsing headers at expected locations
314(based on the length fields at guessed locations).
315Whenever the expected location does not contain a valid header the scan
316is restarted.
317
318When the header is matched the device sends a confirmation request
319to the kernel, asking if the guessed location is correct (if a TLS record
320really starts there), and which record sequence number the given header had.
321
322The asynchronous resync process is coordinated on the kernel side using
323struct tls_offload_resync_async, which tracks and manages the resync request.
324
325Helper functions to manage struct tls_offload_resync_async:
326
327``tls_offload_rx_resync_async_request_start()``
328Initializes an asynchronous resync attempt by specifying the sequence range to
329monitor and resetting internal state in the struct.
330
331``tls_offload_rx_resync_async_request_end()``
332Retains the device's guessed TCP sequence number for comparison with current or
333future logged ones. It also clears the RESYNC_REQ_ASYNC flag from the resync
334request, indicating that the device has submitted its guessed sequence number.
335
336``tls_offload_rx_resync_async_request_cancel()``
337Cancels any in-progress resync attempt, clearing the request state.
338
339When the kernel processes an RX segment that begins a new TLS record, it
340examines the current status of the asynchronous resynchronization request.
341
342If the device is still waiting to provide its guessed TCP sequence number
343(the async state), the kernel records the sequence number of this segment so
344that it can later be compared once the device's guess becomes available.
345
346If the device has already submitted its guessed sequence number (the non-async
347state), the kernel now tries to match that guess against the sequence numbers of
348all TLS record headers that have been logged since the resync request
349started.
350
351The kernel confirms the guessed location was correct and tells the device
352the record sequence number. Meanwhile, the device had been parsing
353and counting all records since the just-confirmed one, it adds the number
354of records it had seen to the record number provided by the kernel.
355At this point the device is in sync and can resume decryption at next
356segment boundary.
357
358In a pathological case the device may latch onto a sequence of matching
359headers and never hear back from the kernel (there is no negative
360confirmation from the kernel). The implementation may choose to periodically
361restart scan. Given how unlikely falsely-matching stream is, however,
362periodic restart is not deemed necessary.
363
364Special care has to be taken if the confirmation request is passed
365asynchronously to the packet stream and record may get processed
366by the kernel before the confirmation request.
367
368Stack-driven resynchronization
369~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
370
371The driver may also request the stack to perform resynchronization
372whenever it sees the records are no longer getting decrypted.
373If the connection is configured in this mode the stack automatically
374schedules resynchronization after it has received two completely encrypted
375records.
376
377The stack waits for the socket to drain and informs the device about
378the next expected record number and its TCP sequence number. If the
379records continue to be received fully encrypted stack retries the
380synchronization with an exponential back off (first after 2 encrypted
381records, then after 4 records, after 8, after 16... up until every
382128 records).
383
384Error handling
385==============
386
387TX
388--
389
390Packets may be redirected or rerouted by the stack to a different
391device than the selected TLS offload device. The stack will handle
392such condition using the :c:func:`sk_validate_xmit_skb` helper
393(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook).
394Offload maintains information about all records until the data is
395fully acknowledged, so if skbs reach the wrong device they can be handled
396by software fallback.
397
398Any device TLS offload handling error on the transmission side must result
399in the packet being dropped. For example if a packet got out of order
400due to a bug in the stack or the device, reached the device and can't
401be encrypted such packet must be dropped.
402
403RX
404--
405
406If the device encounters any problems with TLS offload on the receive
407side it should pass the packet to the host's networking stack as it was
408received on the wire.
409
410For example authentication failure for any record in the segment should
411result in passing the unmodified packet to the software fallback. This means
412packets should not be modified "in place". Splitting segments to handle partial
413decryption is not advised. In other words either all records in the packet
414had been handled successfully and authenticated or the packet has to be passed
415to the host's stack as it was on the wire (recovering original packet in the
416driver if device provides precise error is sufficient).
417
418The Linux networking stack does not provide a way of reporting per-packet
419decryption and authentication errors, packets with errors must simply not
420have the :c:member:`decrypted` mark set.
421
422A packet should also not be handled by the TLS offload if it contains
423incorrect checksums.
424
425Performance metrics
426===================
427
428TLS offload can be characterized by the following basic metrics:
429
430 * max connection count
431 * connection installation rate
432 * connection installation latency
433 * total cryptographic performance
434
435Note that each TCP connection requires a TLS session in both directions,
436the performance may be reported treating each direction separately.
437
438Max connection count
439--------------------
440
441The number of connections device can support can be exposed via
442``devlink resource`` API.
443
444Total cryptographic performance
445-------------------------------
446
447Offload performance may depend on segment and record size.
448
449Overload of the cryptographic subsystem of the device should not have
450significant performance impact on non-offloaded streams.
451
452Statistics
453==========
454
455Following minimum set of TLS-related statistics should be reported
456by the driver:
457
458 * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets
459   which were part of a TLS stream.
460 * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
461   which were successfully decrypted.
462 * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
463   decryption.
464 * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
465   (connection has finished).
466 * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
467    request.
468 * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
469    was started.
470 * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
471    properly ended with providing the HW tracked tcp-seq.
472 * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
473    procedure was started but not properly ended.
474 * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
475    the driver was successfully handled.
476 * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
477    the driver was terminated unsuccessfully.
478 * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
479   but were not decrypted due to unexpected error in the state machine.
480 * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
481   for encryption of their TLS payload.
482 * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
483   passed to the device for encryption.
484 * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for
485   encryption.
486 * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
487   but did not arrive in the expected order.
488 * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of
489   a TLS stream and arrived out-of-order, but skipped the HW offload routine
490   and went to the regular transmit flow as they were retransmissions of the
491   connection handshake.
492 * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
493   a TLS stream dropped, because they arrived out of order and associated
494   record could not be found.
495 * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS
496   stream dropped, because they contain both data that has been encrypted by
497   software and data that expects hardware crypto offload.
498
499Notable corner cases, exceptions and additional requirements
500============================================================
501
502.. _5tuple_problems:
503
5045-tuple matching limitations
505----------------------------
506
507The device can only recognize received packets based on the 5-tuple
508of the socket. Current ``ktls`` implementation will not offload sockets
509routed through software interfaces such as those used for tunneling
510or virtual networking. However, many packet transformations performed
511by the networking stack (most notably any BPF logic) do not require
512any intermediate software device, therefore a 5-tuple match may
513consistently miss at the device level. In such cases the device
514should still be able to perform TX offload (encryption) and should
515fallback cleanly to software decryption (RX).
516
517Out of order
518------------
519
520Introducing extra processing in NICs should not cause packets to be
521transmitted or received out of order, for example pure ACK packets
522should not be reordered with respect to data segments.
523
524Ingress reorder
525---------------
526
527A device is permitted to perform packet reordering for consecutive
528TCP segments (i.e. placing packets in the correct order) but any form
529of additional buffering is disallowed.
530
531Coexistence with standard networking offload features
532-----------------------------------------------------
533
534Offloaded ``ktls`` sockets should support standard TCP stack features
535transparently. Enabling device TLS offload should not cause any difference
536in packets as seen on the wire.
537
538Transport layer transparency
539----------------------------
540
541For the purpose of simplifying TLS offload, the device should not modify any
542packet headers.
543
544The device should not depend on any packet headers beyond what is strictly
545necessary for TLS offload.
546
547Segment drops
548-------------
549
550Dropping packets is acceptable only in the event of catastrophic
551system errors and should never be used as an error handling mechanism
552in cases arising from normal operation. In other words, reliance
553on TCP retransmissions to handle corner cases is not acceptable.
554
555TLS device features
556-------------------
557
558Drivers should ignore the changes to the TLS device feature flags.
559These flags will be acted upon accordingly by the core ``ktls`` code.
560TLS device feature flags only control adding of new TLS connection
561offloads, old connections will remain active after flags are cleared.
562
563TLS encryption cannot be offloaded to devices without checksum calculation
564offload. Hence, TLS TX device feature flag requires TX csum offload being set.
565Disabling the latter implies clearing the former. Disabling TX checksum offload
566should not affect old connections, and drivers should make sure checksum
567calculation does not break for them.
568Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
569does not want to enable RX csum offload, TLS RX device feature is disabled
570as well.
571