1.. _kernel_tls: 2 3========== 4Kernel TLS 5========== 6 7Overview 8======== 9 10Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over 11TCP. TLS provides end-to-end data integrity and confidentiality. 12 13User interface 14============== 15 16Creating a TLS connection 17------------------------- 18 19First create a new TCP socket and once the connection is established set the 20TLS ULP. 21 22.. code-block:: c 23 24 sock = socket(AF_INET, SOCK_STREAM, 0); 25 connect(sock, addr, addrlen); 26 setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); 27 28Setting the TLS ULP allows us to set/get TLS socket options. Currently 29only the symmetric encryption is handled in the kernel. After the TLS 30handshake is complete, we have all the parameters required to move the 31data-path to the kernel. There is a separate socket option for moving 32the transmit and the receive into the kernel. 33 34.. code-block:: c 35 36 /* From linux/tls.h */ 37 struct tls_crypto_info { 38 unsigned short version; 39 unsigned short cipher_type; 40 }; 41 42 struct tls12_crypto_info_aes_gcm_128 { 43 struct tls_crypto_info info; 44 unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE]; 45 unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE]; 46 unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE]; 47 unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE]; 48 }; 49 50 51 struct tls12_crypto_info_aes_gcm_128 crypto_info; 52 53 crypto_info.info.version = TLS_1_2_VERSION; 54 crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128; 55 memcpy(crypto_info.iv, iv_write, TLS_CIPHER_AES_GCM_128_IV_SIZE); 56 memcpy(crypto_info.rec_seq, seq_number_write, 57 TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); 58 memcpy(crypto_info.key, cipher_key_write, TLS_CIPHER_AES_GCM_128_KEY_SIZE); 59 memcpy(crypto_info.salt, implicit_iv_write, TLS_CIPHER_AES_GCM_128_SALT_SIZE); 60 61 setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)); 62 63Transmit and receive are set separately, but the setup is the same, using either 64TLS_TX or TLS_RX. 65 66Sending TLS application data 67---------------------------- 68 69After setting the TLS_TX socket option all application data sent over this 70socket is encrypted using TLS and the parameters provided in the socket option. 71For example, we can send an encrypted hello world record as follows: 72 73.. code-block:: c 74 75 const char *msg = "hello world\n"; 76 send(sock, msg, strlen(msg)); 77 78send() data is directly encrypted from the userspace buffer provided 79to the encrypted kernel send buffer if possible. 80 81The sendfile system call will send the file's data over TLS records of maximum 82length (2^14). 83 84.. code-block:: c 85 86 file = open(filename, O_RDONLY); 87 fstat(file, &stat); 88 sendfile(sock, file, &offset, stat.st_size); 89 90TLS records are created and sent after each send() call, unless 91MSG_MORE is passed. MSG_MORE will delay creation of a record until 92MSG_MORE is not passed, or the maximum record size is reached. 93 94The kernel will need to allocate a buffer for the encrypted data. 95This buffer is allocated at the time send() is called, such that 96either the entire send() call will return -ENOMEM (or block waiting 97for memory), or the encryption will always succeed. If send() returns 98-ENOMEM and some data was left on the socket buffer from a previous 99call using MSG_MORE, the MSG_MORE data is left on the socket buffer. 100 101Receiving TLS application data 102------------------------------ 103 104After setting the TLS_RX socket option, all recv family socket calls 105are decrypted using TLS parameters provided. A full TLS record must 106be received before decryption can happen. 107 108.. code-block:: c 109 110 char buffer[16384]; 111 recv(sock, buffer, 16384); 112 113Received data is decrypted directly in to the user buffer if it is 114large enough, and no additional allocations occur. If the userspace 115buffer is too small, data is decrypted in the kernel and copied to 116userspace. 117 118``EINVAL`` is returned if the TLS version in the received message does not 119match the version passed in setsockopt. 120 121``EMSGSIZE`` is returned if the received message is too big. 122 123``EBADMSG`` is returned if decryption failed for any other reason. 124 125Send TLS control messages 126------------------------- 127 128Other than application data, TLS has control messages such as alert 129messages (record type 21) and handshake messages (record type 22), etc. 130These messages can be sent over the socket by providing the TLS record type 131via a CMSG. For example the following function sends @data of @length bytes 132using a record of type @record_type. 133 134.. code-block:: c 135 136 /* send TLS control message using record_type */ 137 static int klts_send_ctrl_message(int sock, unsigned char record_type, 138 void *data, size_t length) 139 { 140 struct msghdr msg = {0}; 141 int cmsg_len = sizeof(record_type); 142 struct cmsghdr *cmsg; 143 char buf[CMSG_SPACE(cmsg_len)]; 144 struct iovec msg_iov; /* Vector of data to send/receive into. */ 145 146 msg.msg_control = buf; 147 msg.msg_controllen = sizeof(buf); 148 cmsg = CMSG_FIRSTHDR(&msg); 149 cmsg->cmsg_level = SOL_TLS; 150 cmsg->cmsg_type = TLS_SET_RECORD_TYPE; 151 cmsg->cmsg_len = CMSG_LEN(cmsg_len); 152 *CMSG_DATA(cmsg) = record_type; 153 msg.msg_controllen = cmsg->cmsg_len; 154 155 msg_iov.iov_base = data; 156 msg_iov.iov_len = length; 157 msg.msg_iov = &msg_iov; 158 msg.msg_iovlen = 1; 159 160 return sendmsg(sock, &msg, 0); 161 } 162 163Control message data should be provided unencrypted, and will be 164encrypted by the kernel. 165 166Receiving TLS control messages 167------------------------------ 168 169TLS control messages are passed in the userspace buffer, with message 170type passed via cmsg. If no cmsg buffer is provided, an error is 171returned if a control message is received. Data messages may be 172received without a cmsg buffer set. 173 174.. code-block:: c 175 176 char buffer[16384]; 177 char cmsg[CMSG_SPACE(sizeof(unsigned char))]; 178 struct msghdr msg = {0}; 179 msg.msg_control = cmsg; 180 msg.msg_controllen = sizeof(cmsg); 181 182 struct iovec msg_iov; 183 msg_iov.iov_base = buffer; 184 msg_iov.iov_len = 16384; 185 186 msg.msg_iov = &msg_iov; 187 msg.msg_iovlen = 1; 188 189 int ret = recvmsg(sock, &msg, 0 /* flags */); 190 191 struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg); 192 if (cmsg->cmsg_level == SOL_TLS && 193 cmsg->cmsg_type == TLS_GET_RECORD_TYPE) { 194 int record_type = *((unsigned char *)CMSG_DATA(cmsg)); 195 // Do something with record_type, and control message data in 196 // buffer. 197 // 198 // Note that record_type may be == to application data (23). 199 } else { 200 // Buffer contains application data. 201 } 202 203recv will never return data from mixed types of TLS records. 204 205TLS 1.3 Key Updates 206------------------- 207 208In TLS 1.3, KeyUpdate handshake messages signal that the sender is 209updating its TX key. Any message sent after a KeyUpdate will be 210encrypted using the new key. The userspace library can pass the new 211key to the kernel using the TLS_TX and TLS_RX socket options, as for 212the initial keys. TLS version and cipher cannot be changed. 213 214To prevent attempting to decrypt incoming records using the wrong key, 215decryption will be paused when a KeyUpdate message is received by the 216kernel, until the new key has been provided using the TLS_RX socket 217option. Any read occurring after the KeyUpdate has been read and 218before the new key is provided will fail with EKEYEXPIRED. poll() will 219not report any read events from the socket until the new key is 220provided. There is no pausing on the transmit side. 221 222Userspace should make sure that the crypto_info provided has been set 223properly. In particular, the kernel will not check for key/nonce 224reuse. 225 226The number of successful and failed key updates is tracked in the 227``TlsTxRekeyOk``, ``TlsRxRekeyOk``, ``TlsTxRekeyError``, 228``TlsRxRekeyError`` statistics. The ``TlsRxRekeyReceived`` statistic 229counts KeyUpdate handshake messages that have been received. 230 231Integrating in to userspace TLS library 232--------------------------------------- 233 234At a high level, the kernel TLS ULP is a replacement for the record 235layer of a userspace TLS library. 236 237A patchset to OpenSSL to use ktls as the record layer is 238`here <https://github.com/Mellanox/openssl/commits/tls_rx2>`_. 239 240`An example <https://github.com/ktls/af_ktls-tool/commits/RX>`_ 241of calling send directly after a handshake using gnutls. 242Since it doesn't implement a full record layer, control 243messages are not supported. 244 245Optional optimizations 246---------------------- 247 248There are certain condition-specific optimizations the TLS ULP can make, 249if requested. Those optimizations are either not universally beneficial 250or may impact correctness, hence they require an opt-in. 251All options are set per-socket using setsockopt(), and their 252state can be checked using getsockopt() and via socket diag (``ss``). 253 254TLS_TX_ZEROCOPY_RO 255~~~~~~~~~~~~~~~~~~ 256 257For device offload only. Allow sendfile() data to be transmitted directly 258to the NIC without making an in-kernel copy. This allows true zero-copy 259behavior when device offload is enabled. 260 261The application must make sure that the data is not modified between being 262submitted and transmission completing. In other words this is mostly 263applicable if the data sent on a socket via sendfile() is read-only. 264 265Modifying the data may result in different versions of the data being used 266for the original TCP transmission and TCP retransmissions. To the receiver 267this will look like TLS records had been tampered with and will result 268in record authentication failures. 269 270TLS_RX_EXPECT_NO_PAD 271~~~~~~~~~~~~~~~~~~~~ 272 273TLS 1.3 only. Expect the sender to not pad records. This allows the data 274to be decrypted directly into user space buffers with TLS 1.3. 275 276This optimization is safe to enable only if the remote end is trusted, 277otherwise it is an attack vector to doubling the TLS processing cost. 278 279If the record decrypted turns out to had been padded or is not a data 280record it will be decrypted again into a kernel buffer without zero copy. 281Such events are counted in the ``TlsDecryptRetry`` statistic. 282 283Statistics 284========== 285 286TLS implementation exposes the following per-namespace statistics 287(``/proc/net/tls_stat``): 288 289- ``TlsCurrTxSw``, ``TlsCurrRxSw`` - 290 number of TX and RX sessions currently installed where host handles 291 cryptography 292 293- ``TlsCurrTxDevice``, ``TlsCurrRxDevice`` - 294 number of TX and RX sessions currently installed where NIC handles 295 cryptography 296 297- ``TlsTxSw``, ``TlsRxSw`` - 298 number of TX and RX sessions opened with host cryptography 299 300- ``TlsTxDevice``, ``TlsRxDevice`` - 301 number of TX and RX sessions opened with NIC cryptography 302 303- ``TlsDecryptError`` - 304 record decryption failed (e.g. due to incorrect authentication tag) 305 306- ``TlsDeviceRxResync`` - 307 number of RX resyncs sent to NICs handling cryptography 308 309- ``TlsDecryptRetry`` - 310 number of RX records which had to be re-decrypted due to 311 ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will 312 also increment for non-data records. 313 314- ``TlsRxNoPadViolation`` - 315 number of data RX records which had to be re-decrypted due to 316 ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. 317 318- ``TlsTxRekeyOk``, ``TlsRxRekeyOk`` - 319 number of successful rekeys on existing sessions for TX and RX 320 321- ``TlsTxRekeyError``, ``TlsRxRekeyError`` - 322 number of failed rekeys on existing sessions for TX and RX 323 324- ``TlsRxRekeyReceived`` - 325 number of received KeyUpdate handshake messages, requiring userspace 326 to provide a new RX key 327