1.. _kernel_tls: 2 3========== 4Kernel TLS 5========== 6 7Overview 8======== 9 10Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over 11TCP. TLS provides end-to-end data integrity and confidentiality. 12 13User interface 14============== 15 16Creating a TLS connection 17------------------------- 18 19First create a new TCP socket and set the TLS ULP. 20 21.. code-block:: c 22 23 sock = socket(AF_INET, SOCK_STREAM, 0); 24 setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); 25 26Setting the TLS ULP allows us to set/get TLS socket options. Currently 27only the symmetric encryption is handled in the kernel. After the TLS 28handshake is complete, we have all the parameters required to move the 29data-path to the kernel. There is a separate socket option for moving 30the transmit and the receive into the kernel. 31 32.. code-block:: c 33 34 /* From linux/tls.h */ 35 struct tls_crypto_info { 36 unsigned short version; 37 unsigned short cipher_type; 38 }; 39 40 struct tls12_crypto_info_aes_gcm_128 { 41 struct tls_crypto_info info; 42 unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE]; 43 unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE]; 44 unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE]; 45 unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE]; 46 }; 47 48 49 struct tls12_crypto_info_aes_gcm_128 crypto_info; 50 51 crypto_info.info.version = TLS_1_2_VERSION; 52 crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128; 53 memcpy(crypto_info.iv, iv_write, TLS_CIPHER_AES_GCM_128_IV_SIZE); 54 memcpy(crypto_info.rec_seq, seq_number_write, 55 TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); 56 memcpy(crypto_info.key, cipher_key_write, TLS_CIPHER_AES_GCM_128_KEY_SIZE); 57 memcpy(crypto_info.salt, implicit_iv_write, TLS_CIPHER_AES_GCM_128_SALT_SIZE); 58 59 setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)); 60 61Transmit and receive are set separately, but the setup is the same, using either 62TLS_TX or TLS_RX. 63 64Sending TLS application data 65---------------------------- 66 67After setting the TLS_TX socket option all application data sent over this 68socket is encrypted using TLS and the parameters provided in the socket option. 69For example, we can send an encrypted hello world record as follows: 70 71.. code-block:: c 72 73 const char *msg = "hello world\n"; 74 send(sock, msg, strlen(msg)); 75 76send() data is directly encrypted from the userspace buffer provided 77to the encrypted kernel send buffer if possible. 78 79The sendfile system call will send the file's data over TLS records of maximum 80length (2^14). 81 82.. code-block:: c 83 84 file = open(filename, O_RDONLY); 85 fstat(file, &stat); 86 sendfile(sock, file, &offset, stat.st_size); 87 88TLS records are created and sent after each send() call, unless 89MSG_MORE is passed. MSG_MORE will delay creation of a record until 90MSG_MORE is not passed, or the maximum record size is reached. 91 92The kernel will need to allocate a buffer for the encrypted data. 93This buffer is allocated at the time send() is called, such that 94either the entire send() call will return -ENOMEM (or block waiting 95for memory), or the encryption will always succeed. If send() returns 96-ENOMEM and some data was left on the socket buffer from a previous 97call using MSG_MORE, the MSG_MORE data is left on the socket buffer. 98 99Receiving TLS application data 100------------------------------ 101 102After setting the TLS_RX socket option, all recv family socket calls 103are decrypted using TLS parameters provided. A full TLS record must 104be received before decryption can happen. 105 106.. code-block:: c 107 108 char buffer[16384]; 109 recv(sock, buffer, 16384); 110 111Received data is decrypted directly in to the user buffer if it is 112large enough, and no additional allocations occur. If the userspace 113buffer is too small, data is decrypted in the kernel and copied to 114userspace. 115 116``EINVAL`` is returned if the TLS version in the received message does not 117match the version passed in setsockopt. 118 119``EMSGSIZE`` is returned if the received message is too big. 120 121``EBADMSG`` is returned if decryption failed for any other reason. 122 123Send TLS control messages 124------------------------- 125 126Other than application data, TLS has control messages such as alert 127messages (record type 21) and handshake messages (record type 22), etc. 128These messages can be sent over the socket by providing the TLS record type 129via a CMSG. For example the following function sends @data of @length bytes 130using a record of type @record_type. 131 132.. code-block:: c 133 134 /* send TLS control message using record_type */ 135 static int klts_send_ctrl_message(int sock, unsigned char record_type, 136 void *data, size_t length) 137 { 138 struct msghdr msg = {0}; 139 int cmsg_len = sizeof(record_type); 140 struct cmsghdr *cmsg; 141 char buf[CMSG_SPACE(cmsg_len)]; 142 struct iovec msg_iov; /* Vector of data to send/receive into. */ 143 144 msg.msg_control = buf; 145 msg.msg_controllen = sizeof(buf); 146 cmsg = CMSG_FIRSTHDR(&msg); 147 cmsg->cmsg_level = SOL_TLS; 148 cmsg->cmsg_type = TLS_SET_RECORD_TYPE; 149 cmsg->cmsg_len = CMSG_LEN(cmsg_len); 150 *CMSG_DATA(cmsg) = record_type; 151 msg.msg_controllen = cmsg->cmsg_len; 152 153 msg_iov.iov_base = data; 154 msg_iov.iov_len = length; 155 msg.msg_iov = &msg_iov; 156 msg.msg_iovlen = 1; 157 158 return sendmsg(sock, &msg, 0); 159 } 160 161Control message data should be provided unencrypted, and will be 162encrypted by the kernel. 163 164Receiving TLS control messages 165------------------------------ 166 167TLS control messages are passed in the userspace buffer, with message 168type passed via cmsg. If no cmsg buffer is provided, an error is 169returned if a control message is received. Data messages may be 170received without a cmsg buffer set. 171 172.. code-block:: c 173 174 char buffer[16384]; 175 char cmsg[CMSG_SPACE(sizeof(unsigned char))]; 176 struct msghdr msg = {0}; 177 msg.msg_control = cmsg; 178 msg.msg_controllen = sizeof(cmsg); 179 180 struct iovec msg_iov; 181 msg_iov.iov_base = buffer; 182 msg_iov.iov_len = 16384; 183 184 msg.msg_iov = &msg_iov; 185 msg.msg_iovlen = 1; 186 187 int ret = recvmsg(sock, &msg, 0 /* flags */); 188 189 struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg); 190 if (cmsg->cmsg_level == SOL_TLS && 191 cmsg->cmsg_type == TLS_GET_RECORD_TYPE) { 192 int record_type = *((unsigned char *)CMSG_DATA(cmsg)); 193 // Do something with record_type, and control message data in 194 // buffer. 195 // 196 // Note that record_type may be == to application data (23). 197 } else { 198 // Buffer contains application data. 199 } 200 201recv will never return data from mixed types of TLS records. 202 203TLS 1.3 Key Updates 204------------------- 205 206In TLS 1.3, KeyUpdate handshake messages signal that the sender is 207updating its TX key. Any message sent after a KeyUpdate will be 208encrypted using the new key. The userspace library can pass the new 209key to the kernel using the TLS_TX and TLS_RX socket options, as for 210the initial keys. TLS version and cipher cannot be changed. 211 212To prevent attempting to decrypt incoming records using the wrong key, 213decryption will be paused when a KeyUpdate message is received by the 214kernel, until the new key has been provided using the TLS_RX socket 215option. Any read occurring after the KeyUpdate has been read and 216before the new key is provided will fail with EKEYEXPIRED. poll() will 217not report any read events from the socket until the new key is 218provided. There is no pausing on the transmit side. 219 220Userspace should make sure that the crypto_info provided has been set 221properly. In particular, the kernel will not check for key/nonce 222reuse. 223 224The number of successful and failed key updates is tracked in the 225``TlsTxRekeyOk``, ``TlsRxRekeyOk``, ``TlsTxRekeyError``, 226``TlsRxRekeyError`` statistics. The ``TlsRxRekeyReceived`` statistic 227counts KeyUpdate handshake messages that have been received. 228 229Integrating in to userspace TLS library 230--------------------------------------- 231 232At a high level, the kernel TLS ULP is a replacement for the record 233layer of a userspace TLS library. 234 235A patchset to OpenSSL to use ktls as the record layer is 236`here <https://github.com/Mellanox/openssl/commits/tls_rx2>`_. 237 238`An example <https://github.com/ktls/af_ktls-tool/commits/RX>`_ 239of calling send directly after a handshake using gnutls. 240Since it doesn't implement a full record layer, control 241messages are not supported. 242 243Optional optimizations 244---------------------- 245 246There are certain condition-specific optimizations the TLS ULP can make, 247if requested. Those optimizations are either not universally beneficial 248or may impact correctness, hence they require an opt-in. 249All options are set per-socket using setsockopt(), and their 250state can be checked using getsockopt() and via socket diag (``ss``). 251 252TLS_TX_ZEROCOPY_RO 253~~~~~~~~~~~~~~~~~~ 254 255For device offload only. Allow sendfile() data to be transmitted directly 256to the NIC without making an in-kernel copy. This allows true zero-copy 257behavior when device offload is enabled. 258 259The application must make sure that the data is not modified between being 260submitted and transmission completing. In other words this is mostly 261applicable if the data sent on a socket via sendfile() is read-only. 262 263Modifying the data may result in different versions of the data being used 264for the original TCP transmission and TCP retransmissions. To the receiver 265this will look like TLS records had been tampered with and will result 266in record authentication failures. 267 268TLS_RX_EXPECT_NO_PAD 269~~~~~~~~~~~~~~~~~~~~ 270 271TLS 1.3 only. Expect the sender to not pad records. This allows the data 272to be decrypted directly into user space buffers with TLS 1.3. 273 274This optimization is safe to enable only if the remote end is trusted, 275otherwise it is an attack vector to doubling the TLS processing cost. 276 277If the record decrypted turns out to had been padded or is not a data 278record it will be decrypted again into a kernel buffer without zero copy. 279Such events are counted in the ``TlsDecryptRetry`` statistic. 280 281Statistics 282========== 283 284TLS implementation exposes the following per-namespace statistics 285(``/proc/net/tls_stat``): 286 287- ``TlsCurrTxSw``, ``TlsCurrRxSw`` - 288 number of TX and RX sessions currently installed where host handles 289 cryptography 290 291- ``TlsCurrTxDevice``, ``TlsCurrRxDevice`` - 292 number of TX and RX sessions currently installed where NIC handles 293 cryptography 294 295- ``TlsTxSw``, ``TlsRxSw`` - 296 number of TX and RX sessions opened with host cryptography 297 298- ``TlsTxDevice``, ``TlsRxDevice`` - 299 number of TX and RX sessions opened with NIC cryptography 300 301- ``TlsDecryptError`` - 302 record decryption failed (e.g. due to incorrect authentication tag) 303 304- ``TlsDeviceRxResync`` - 305 number of RX resyncs sent to NICs handling cryptography 306 307- ``TlsDecryptRetry`` - 308 number of RX records which had to be re-decrypted due to 309 ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will 310 also increment for non-data records. 311 312- ``TlsRxNoPadViolation`` - 313 number of data RX records which had to be re-decrypted due to 314 ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. 315 316- ``TlsTxRekeyOk``, ``TlsRxRekeyOk`` - 317 number of successful rekeys on existing sessions for TX and RX 318 319- ``TlsTxRekeyError``, ``TlsRxRekeyError`` - 320 number of failed rekeys on existing sessions for TX and RX 321 322- ``TlsRxRekeyReceived`` - 323 number of received KeyUpdate handshake messages, requiring userspace 324 to provide a new RX key 325