xref: /linux/Documentation/networking/tls-offload.rst (revision bba2c3615bd6cfee7456d1130f2e6b01b3f4e9ba)
1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
2
3==================
4Kernel TLS offload
5==================
6
7Kernel TLS operation
8====================
9
10Linux kernel provides TLS connection offload infrastructure. Once a TCP
11connection is in ``ESTABLISHED`` state user space can enable the TLS Upper
12Layer Protocol (ULP) and install the cryptographic connection state.
13For details regarding the user-facing interface refer to the TLS
14documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`.
15
16``ktls`` can operate in two modes:
17
18 * Software crypto mode (``TLS_SW``) - CPU handles the cryptography.
19   In most basic cases only crypto operations synchronous with the CPU
20   can be used, but depending on calling context CPU may utilize
21   asynchronous crypto accelerators. The use of accelerators introduces extra
22   latency on socket reads (decryption only starts when a read syscall
23   is made) and additional I/O load on the system.
24 * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto
25   on a packet by packet basis, provided the packets arrive in order.
26   This mode integrates best with the kernel stack and is described in detail
27   in the remaining part of this document
28   (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``).
29
30The operation mode is selected automatically based on device configuration,
31offload opt-in or opt-out on per-connection basis is not currently supported.
32
33TX
34--
35
36At a high level user write requests are turned into a scatter list, the TLS ULP
37intercepts them, inserts record framing, performs encryption (in ``TLS_SW``
38mode) and then hands the modified scatter list to the TCP layer. From this
39point on the TCP stack proceeds as normal.
40
41In ``TLS_HW`` mode the encryption is not performed in the TLS ULP.
42Instead packets reach a device driver, the driver will mark the packets
43for crypto offload based on the socket the packet is attached to,
44and send them to the device for encryption and transmission.
45
46RX
47--
48
49On the receive side, if the device handled decryption and authentication
50successfully, the driver will set the decrypted bit in the associated
51:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
52are handled normally. ``ktls`` is informed when data is queued to the socket
53and the ``strparser`` mechanism is used to delineate the records. Upon read
54request, records are retrieved from the socket and passed to decryption routine.
55If device decrypted all the segments of the record the decryption is skipped,
56otherwise software path handles decryption.
57
58.. kernel-figure::  tls-offload-layers.svg
59   :alt:	TLS offload layers
60   :align:	center
61   :figwidth:	28em
62
63   Layers of Kernel TLS stack
64
65Device configuration
66====================
67
68During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and
69``NETIF_F_HW_TLS_TX`` features and installs its
70:c:type:`struct tlsdev_ops <tlsdev_ops>`
71pointer in the :c:member:`tlsdev_ops` member of the
72:c:type:`struct net_device <net_device>`.
73
74When TLS cryptographic connection state is installed on a ``ktls`` socket
75(note that it is done twice, once for RX and once for TX direction,
76and the two are completely independent), the kernel checks if the underlying
77network device is offload-capable and attempts the offload. In case offload
78fails the connection is handled entirely in software using the same mechanism
79as if the offload was never tried.
80
81Offload request is performed via the :c:member:`tls_dev_add` callback of
82:c:type:`struct tlsdev_ops <tlsdev_ops>`:
83
84.. code-block:: c
85
86	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
87			   enum tls_offload_ctx_dir direction,
88			   struct tls_crypto_info *crypto_info,
89			   u32 start_offload_tcp_sn);
90
91``direction`` indicates whether the cryptographic information is for
92the received or transmitted packets. Driver uses the ``sk`` parameter
93to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).
94Cryptographic information in ``crypto_info`` includes the key, iv, salt
95as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates
96which TCP sequence number corresponds to the beginning of the record with
97sequence number from ``crypto_info``. The driver can add its state
98at the end of kernel structures (see :c:member:`driver_state` members
99in ``include/net/tls.h``) to avoid additional allocations and pointer
100dereferences.
101
102When the offloaded connection is destroyed the core calls
103the :c:member:`tls_dev_del` callback so the driver can release per-direction
104state:
105
106.. code-block:: c
107
108	void (*tls_dev_del)(struct net_device *netdev,
109			    struct tls_context *ctx,
110			    enum tls_offload_ctx_dir direction);
111
112``tls_dev_del`` is mandatory whenever ``tls_dev_add`` is provided.
113
114The third TLS device callback is :c:member:`tls_dev_resync`, called by the core
115to synchronize the TCP stream with the record boundaries:
116
117.. code-block:: c
118
119	int (*tls_dev_resync)(struct net_device *netdev,
120			      struct sock *sk, u32 seq, u8 *rcd_sn,
121			      enum tls_offload_ctx_dir direction);
122
123See the `Resync handling`_ section for details.
124
125TX
126--
127
128After TX state is installed, the stack guarantees that the first segment
129of the stream will start exactly at the ``start_offload_tcp_sn`` sequence
130number, simplifying TCP sequence number matching.
131
132TX offload being fully initialized does not imply that all segments passing
133through the driver and which belong to the offloaded socket will be after
134the expected sequence number and will have kernel record information.
135In particular, already encrypted data may have been queued to the socket
136before installing the connection state in the kernel.
137
138RX
139--
140
141In the RX direction, the local networking stack has little control over
142segmentation, so the initial records' TCP sequence number may be anywhere
143inside the segment.
144
145Normal operation
146================
147
148At the minimum the device maintains the following state for each connection, in
149each direction:
150
151 * crypto secrets (key, iv, salt)
152 * crypto processing state (partial blocks, partial authentication tag, etc.)
153 * record metadata (sequence number, processing offset and length)
154 * expected TCP sequence number
155
156There are no guarantees on record length or record segmentation. In particular
157segments may start at any point of a record and contain any number of records.
158Assuming segments are received in order, the device should be able to perform
159crypto operations and authentication regardless of segmentation. For this
160to be possible, the device has to keep a small amount of segment-to-segment
161state. This includes at least:
162
163 * partial headers (if a segment carried only a part of the TLS header)
164 * partial data block
165 * partial authentication tag (all data had been seen but part of the
166   authentication tag has to be written or read from the subsequent segment)
167
168Record reassembly is not necessary for TLS offload. If the packets arrive
169in order the device should be able to handle them separately and make
170forward progress.
171
172TX
173--
174
175The kernel stack performs record framing reserving space for the authentication
176tag and populating all other TLS header and tailer fields.
177
178Both the device and the driver maintain expected TCP sequence numbers
179due to the possibility of retransmissions and the lack of software fallback
180once the packet reaches the device.
181For segments passed in order, the driver marks the packets with
182a connection identifier (note that a 5-tuple lookup is insufficient to identify
183packets requiring HW offload, see the :ref:`5tuple_problems` section)
184and hands them to the device. The device identifies the packet as requiring
185TLS handling and confirms the sequence number matches its expectation.
186The device performs encryption and authentication of the record data.
187It replaces the authentication tag and TCP checksum with correct values.
188
189RX
190--
191
192Before a packet is DMAed to the host (but after NIC's embedded switching
193and packet transformation functions) the device validates the Layer 4
194checksum and performs a 5-tuple lookup to find any TLS connection the packet
195may belong to (technically a 4-tuple
196lookup is sufficient - IP addresses and TCP port numbers, as the protocol
197is always TCP). If the packet is matched to a connection, the device confirms
198if the TCP sequence number is the expected one and proceeds to TLS handling
199(record delineation, decryption, authentication for each record in the packet).
200The device leaves the record framing unmodified, the stack takes care of record
201decapsulation. Device indicates successful handling of TLS offload in the
202per-packet context (descriptor) passed to the host.
203
204Upon reception of a TLS offloaded packet, the driver sets
205the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
206corresponding to the segment. Networking stack makes sure decrypted
207and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)
208and takes care of partial decryption.
209
210Resync handling
211===============
212
213In presence of packet drops or network packet reordering, the device may lose
214synchronization with the TLS stream, and require a resync with the kernel's
215TCP stack.
216
217Note that resync is only attempted for connections which were successfully
218added to the device table and are in TLS_HW mode. For example,
219if the table was full when cryptographic state was installed in the kernel,
220such connection will never get offloaded. Therefore the resync request
221does not carry any cryptographic connection state.
222
223TX
224--
225
226Segments transmitted from an offloaded socket can get out of sync
227in similar ways to the receive side-retransmissions - local drops
228are possible, though network reorders are not. There are currently
229two mechanisms for dealing with out of order segments.
230
231Crypto state rebuilding
232~~~~~~~~~~~~~~~~~~~~~~~
233
234Whenever an out of order segment is transmitted the driver provides
235the device with enough information to perform cryptographic operations.
236This means most likely that the part of the record preceding the current
237segment has to be passed to the device as part of the packet context,
238together with its TCP sequence number and TLS record number. The device
239can then initialize its crypto state, process and discard the preceding
240data (to be able to insert the authentication tag) and move onto handling
241the actual packet.
242
243In this mode depending on the implementation the driver can either ask
244for a continuation with the crypto state and the new sequence number
245(next expected segment is the one after the out of order one), or continue
246with the previous stream state - assuming that the out of order segment
247was just a retransmission. The former is simpler, and does not require
248retransmission detection therefore it is the recommended method until
249such time it is proven inefficient.
250
251Next record sync
252~~~~~~~~~~~~~~~~
253
254Whenever an out of order segment is detected the driver requests
255that the ``ktls`` software fallback code encrypt it. If the segment's
256sequence number is lower than expected the driver assumes retransmission
257and doesn't change device state. If the segment is in the future, it
258may imply a local drop, the driver asks the stack to sync the device
259to the next record state and falls back to software.
260
261Resync request is indicated with:
262
263.. code-block:: c
264
265  void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq)
266
267Until resync is complete driver should not access its expected TCP
268sequence number (as it will be updated from a different context).
269Following helper should be used to test if resync is complete:
270
271.. code-block:: c
272
273  bool tls_offload_tx_resync_pending(struct sock *sk)
274
275Next time ``ktls`` pushes a record it will first send its TCP sequence number
276and TLS record number to the driver via the ``tls_dev_resync`` callback.
277The stack will also make sure that the new record will start on a segment
278boundary (like it does when the connection is initially added).
279
280RX
281--
282
283A small amount of RX reorder events may not require a full resynchronization.
284In particular the device should not lose synchronization
285when record boundary can be recovered:
286
287.. kernel-figure::  tls-offload-reorder-good.svg
288   :alt:	reorder of non-header segment
289   :align:	center
290
291   Reorder of non-header segment
292
293Green segments are successfully decrypted, blue ones are passed
294as received on wire, red stripes mark start of new records.
295
296In above case segment 1 is received and decrypted successfully.
297Segment 2 was dropped so 3 arrives out of order. The device knows
298the next record starts inside 3, based on record length in segment 1.
299Segment 3 is passed untouched, because due to lack of data from segment 2
300the remainder of the previous record inside segment 3 cannot be handled.
301The device can, however, collect the authentication algorithm's state
302and partial block from the new record in segment 3 and when 4 and 5
303arrive continue decryption. Finally when 2 arrives it's completely outside
304of expected window of the device so it's passed as is without special
305handling. ``ktls`` software fallback handles the decryption of record
306spanning segments 1, 2 and 3. The device did not get out of sync,
307even though two segments did not get decrypted.
308
309Kernel synchronization may be necessary if the lost segment contained
310a record header and arrived after the next record header has already passed:
311
312.. kernel-figure::  tls-offload-reorder-bad.svg
313   :alt:	reorder of header segment
314   :align:	center
315
316   Reorder of segment with a TLS header
317
318In this example segment 2 gets dropped, and it contains a record header.
319Device can only detect that segment 4 also contains a TLS header
320if it knows the length of the previous record from segment 2. In this case
321the device will lose synchronization with the stream.
322
323Stream scan resynchronization
324~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
325
326When the device gets out of sync and the stream reaches TCP sequence
327numbers more than a max size record past the expected TCP sequence number,
328the device starts scanning for a known header pattern. For example
329for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur
330in the SSL/TLS version field of the header. Once pattern is matched
331the device continues attempting parsing headers at expected locations
332(based on the length fields at guessed locations).
333Whenever the expected location does not contain a valid header the scan
334is restarted.
335
336When the header is matched the device sends a confirmation request
337to the kernel, asking if the guessed location is correct (if a TLS record
338really starts there), and which record sequence number the given header had.
339
340The asynchronous resync process is coordinated on the kernel side using
341struct tls_offload_resync_async, which tracks and manages the resync request.
342
343Helper functions to manage struct tls_offload_resync_async:
344
345``tls_offload_rx_resync_async_request_start()``
346Initializes an asynchronous resync attempt by specifying the sequence range to
347monitor and resetting internal state in the struct.
348
349``tls_offload_rx_resync_async_request_end()``
350Retains the device's guessed TCP sequence number for comparison with current or
351future logged ones. It also clears the RESYNC_REQ_ASYNC flag from the resync
352request, indicating that the device has submitted its guessed sequence number.
353
354``tls_offload_rx_resync_async_request_cancel()``
355Cancels any in-progress resync attempt, clearing the request state.
356
357When the kernel processes an RX segment that begins a new TLS record, it
358examines the current status of the asynchronous resynchronization request.
359
360If the device is still waiting to provide its guessed TCP sequence number
361(the async state), the kernel records the sequence number of this segment so
362that it can later be compared once the device's guess becomes available.
363
364If the device has already submitted its guessed sequence number (the non-async
365state), the kernel now tries to match that guess against the sequence numbers of
366all TLS record headers that have been logged since the resync request
367started.
368
369The kernel confirms the guessed location was correct and tells the device
370the record sequence number via the ``tls_dev_resync`` callback. Meanwhile,
371the device had been parsing and counting all records since the just-confirmed
372one, it adds the number of records it had seen to the record number provided
373by the kernel.
374At this point the device is in sync and can resume decryption at next
375segment boundary.
376
377In a pathological case the device may latch onto a sequence of matching
378headers and never hear back from the kernel (there is no negative
379confirmation from the kernel). The implementation may choose to periodically
380restart scan. Given how unlikely falsely-matching stream is, however,
381periodic restart is not deemed necessary.
382
383Special care has to be taken if the confirmation request is passed
384asynchronously to the packet stream and record may get processed
385by the kernel before the confirmation request.
386
387Stack-driven resynchronization
388~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
389
390The driver may also request the stack to perform resynchronization
391whenever it sees the records are no longer getting decrypted.
392If the connection is configured in this mode the stack automatically
393schedules resynchronization after it has received two completely encrypted
394records.
395
396The stack waits for the socket to drain and informs the device about
397the next expected record number and its TCP sequence number via the
398``tls_dev_resync`` callback. If the
399records continue to be received fully encrypted stack retries the
400synchronization with an exponential back off (first after 2 encrypted
401records, then after 4 records, after 8, after 16... up until every
402128 records).
403
404Rekey
405=====
406
407Offload does not currently support TLS 1.3, therefore key rotation
408is not a concern for offloaded connections at this point.
409
410Error handling
411==============
412
413TX
414--
415
416Packets may be redirected or rerouted by the stack to a different
417device than the selected TLS offload device. The stack will handle
418such condition using the :c:func:`sk_validate_xmit_skb` helper
419(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook).
420Offload maintains information about all records until the data is
421fully acknowledged, so if skbs reach the wrong device they can be handled
422by software fallback.
423
424Any device TLS offload handling error on the transmission side must result
425in the packet being dropped. For example if a packet got out of order
426due to a bug in the stack or the device, reached the device and can't
427be encrypted such packet must be dropped.
428
429RX
430--
431
432If the device encounters any problems with TLS offload on the receive
433side it should pass the packet to the host's networking stack as it was
434received on the wire.
435
436For example authentication failure for any record in the segment should
437result in passing the unmodified packet to the software fallback. This means
438packets should not be modified "in place". Splitting segments to handle partial
439decryption is not advised. In other words either all records in the packet
440had been handled successfully and authenticated or the packet has to be passed
441to the host's stack as it was on the wire (recovering original packet in the
442driver if device provides precise error is sufficient).
443
444The Linux networking stack does not provide a way of reporting per-packet
445decryption and authentication errors, packets with errors must simply not
446have the :c:member:`decrypted` mark set.
447
448A packet should also not be handled by the TLS offload if it contains
449incorrect checksums.
450
451Performance metrics
452===================
453
454TLS offload can be characterized by the following basic metrics:
455
456 * max connection count
457 * connection installation rate
458 * connection installation latency
459 * total cryptographic performance
460
461Note that each TCP connection requires a TLS session in both directions,
462the performance may be reported treating each direction separately.
463
464Max connection count
465--------------------
466
467The number of connections device can support can be exposed via
468``devlink resource`` API.
469
470Total cryptographic performance
471-------------------------------
472
473Offload performance may depend on segment and record size.
474
475Overload of the cryptographic subsystem of the device should not have
476significant performance impact on non-offloaded streams.
477
478Statistics
479==========
480
481Following minimum set of TLS-related statistics should be reported
482by the driver:
483
484 * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets
485   which were part of a TLS stream.
486 * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
487   which were successfully decrypted.
488 * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
489   decryption.
490 * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
491   (connection has finished).
492 * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
493    request.
494 * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
495    was started.
496 * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
497    properly ended with providing the HW tracked tcp-seq.
498 * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
499    procedure was started but not properly ended.
500 * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
501    the driver was successfully handled.
502 * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
503    the driver was terminated unsuccessfully.
504 * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
505   but were not decrypted due to unexpected error in the state machine.
506 * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
507   for encryption of their TLS payload.
508 * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
509   passed to the device for encryption.
510 * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for
511   encryption.
512 * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
513   but did not arrive in the expected order.
514 * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of
515   a TLS stream and arrived out-of-order, but skipped the HW offload routine
516   and went to the regular transmit flow as they were retransmissions of the
517   connection handshake.
518 * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
519   a TLS stream dropped, because they arrived out of order and associated
520   record could not be found.
521 * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS
522   stream dropped, because they contain both data that has been encrypted by
523   software and data that expects hardware crypto offload.
524
525Notable corner cases, exceptions and additional requirements
526============================================================
527
528.. _5tuple_problems:
529
5305-tuple matching limitations
531----------------------------
532
533The device can only recognize received packets based on the 5-tuple
534of the socket. Current ``ktls`` implementation will not offload sockets
535routed through software interfaces such as those used for tunneling
536or virtual networking. However, many packet transformations performed
537by the networking stack (most notably any BPF logic) do not require
538any intermediate software device, therefore a 5-tuple match may
539consistently miss at the device level. In such cases the device
540should still be able to perform TX offload (encryption) and should
541fallback cleanly to software decryption (RX).
542
543Out of order
544------------
545
546Introducing extra processing in NICs should not cause packets to be
547transmitted or received out of order, for example pure ACK packets
548should not be reordered with respect to data segments.
549
550Ingress reorder
551---------------
552
553A device is permitted to perform packet reordering for consecutive
554TCP segments (i.e. placing packets in the correct order) but any form
555of additional buffering is disallowed.
556
557Coexistence with standard networking offload features
558-----------------------------------------------------
559
560Offloaded ``ktls`` sockets should support standard TCP stack features
561transparently. Enabling device TLS offload should not cause any difference
562in packets as seen on the wire.
563
564Transport layer transparency
565----------------------------
566
567For the purpose of simplifying TLS offload, the device should not modify any
568packet headers.
569
570The device should not depend on any packet headers beyond what is strictly
571necessary for TLS offload.
572
573Segment drops
574-------------
575
576Dropping packets is acceptable only in the event of catastrophic
577system errors and should never be used as an error handling mechanism
578in cases arising from normal operation. In other words, reliance
579on TCP retransmissions to handle corner cases is not acceptable.
580
581TLS device features
582-------------------
583
584Drivers should ignore the changes to the TLS device feature flags.
585These flags will be acted upon accordingly by the core ``ktls`` code.
586TLS device feature flags only control adding of new TLS connection
587offloads, old connections will remain active after flags are cleared.
588
589TLS encryption cannot be offloaded to devices without checksum calculation
590offload. Hence, TLS TX device feature flag requires TX csum offload being set.
591Disabling the latter implies clearing the former. Disabling TX checksum offload
592should not affect old connections, and drivers should make sure checksum
593calculation does not break for them.
594Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
595does not want to enable RX csum offload, TLS RX device feature is disabled
596as well.
597