#
c5c3ba6b |
| 03-Sep-2019 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r351317 through r351731.
|
#
b2e60773 |
| 27-Aug-2019 |
John Baldwin <jhb@FreeBSD.org> |
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for tr
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.)
KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277
show more ...
|
#
0ecd976e |
| 02-Aug-2019 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaini
IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance.
Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD.
* Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6.
The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing.
No functional changes.
Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix
show more ...
|
#
a63915c2 |
| 28-Jul-2019 |
Alan Somers <asomers@FreeBSD.org> |
MFHead @r350386
Sponsored by: The FreeBSD Foundation
|
Revision tags: release/11.3.0 |
|
#
82334850 |
| 29-Jun-2019 |
John Baldwin <jhb@FreeBSD.org> |
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required.
If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum.
Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616
show more ...
|
#
7648bc9f |
| 13-May-2019 |
Alan Somers <asomers@FreeBSD.org> |
MFHead @347527
Sponsored by: The FreeBSD Foundation
|
#
68cea2b1 |
| 19-Apr-2019 |
John Baldwin <jhb@FreeBSD.org> |
Push down INP_WLOCK slightly in tcp_ctloutput.
The inp lock is not needed for testing the V6 flag as that flag is set once when the inp is created and never changes. For non-TCP socket options the
Push down INP_WLOCK slightly in tcp_ctloutput.
The inp lock is not needed for testing the V6 flag as that flag is set once when the inp is created and never changes. For non-TCP socket options the lock is immediately dropped after checking that flag. This just pushes the lock down to only be acquired for TCP socket options.
This isn't a hot-path, more a cosmetic cleanup I noticed while reading the code.
Reviewed by: bz MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19740
show more ...
|
#
67350cb5 |
| 09-Dec-2018 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r340918 through r341763.
|
Revision tags: release/12.0.0 |
|
#
c8b53ced |
| 30-Nov-2018 |
Michael Tuexen <tuexen@FreeBSD.org> |
Limit option_len for the TCP_CCALGOOPT.
Limiting the length to 2048 bytes seems to be acceptable, since the values used right now are using 8 bytes.
Reviewed by: glebius, bz, rrs MFC after: 3 day
Limit option_len for the TCP_CCALGOOPT.
Limiting the length to 2048 bytes seems to be acceptable, since the values used right now are using 8 bytes.
Reviewed by: glebius, bz, rrs MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18366
show more ...
|
#
7847e041 |
| 24-Aug-2018 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r338026 through r338297, and resolve conflicts.
|
#
c6c0be27 |
| 24-Aug-2018 |
Michael Tuexen <tuexen@FreeBSD.org> |
Fix a shadowed variable warning. Thanks to Peter Lei for reporting the issue.
Approved by: re(kib@) MFH: 1 month Sponsored by: Netflix, Inc.
|
#
5dff1c38 |
| 21-Aug-2018 |
Michael Tuexen <tuexen@FreeBSD.org> |
Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP socket resulted in sending fragmented IPV6 packets.
This is fixes by reducing the MSS to the appropriate value. In addtion, if
Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP socket resulted in sending fragmented IPV6 packets.
This is fixes by reducing the MSS to the appropriate value. In addtion, if the socket option is set before the handshake happens, announce this MSS to the peer. This is not stricly required, but done since TCP is conservative.
PR: 173444 Reviewed by: bz@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16796
show more ...
|
#
c28440db |
| 20-Aug-2018 |
Randall Stewart <rrs@FreeBSD.org> |
This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment co
This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines.
Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626
show more ...
|
#
8e02b4e0 |
| 19-Aug-2018 |
Michael Tuexen <tuexen@FreeBSD.org> |
Don't expose the uptime via the TCP timestamps.
The TCP client side or the TCP server side when not using SYN-cookies used the uptime as the TCP timestamp value. This patch uses in all cases an offs
Don't expose the uptime via the TCP timestamps.
The TCP client side or the TCP server side when not using SYN-cookies used the uptime as the TCP timestamp value. This patch uses in all cases an offset, which is the result of a keyed hash function taking the source and destination addresses and port numbers into account. The keyed hash function is the same a used for the initial TSN.
Reviewed by: rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16636
show more ...
|
#
14b841d4 |
| 11-Aug-2018 |
Kyle Evans <kevans@FreeBSD.org> |
MFH @ r337607, in preparation for boarding
|
#
bbd7a929 |
| 04-Aug-2018 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r336870 through r337285, and resolve conflicts.
|
#
51e08d53 |
| 31-Jul-2018 |
Michael Tuexen <tuexen@FreeBSD.org> |
Fix INET only builds.
r336940 introduced an "unused variable" warning on platforms which support INET, but not INET6, like MALTA and MALTA64 as reported by Mark Millard. Improve the #ifdefs to addre
Fix INET only builds.
r336940 introduced an "unused variable" warning on platforms which support INET, but not INET6, like MALTA and MALTA64 as reported by Mark Millard. Improve the #ifdefs to address this issue.
Sponsored by: Netflix, Inc.
show more ...
|
#
888973f5 |
| 30-Jul-2018 |
Michael Tuexen <tuexen@FreeBSD.org> |
Allow implicit TCP connection setup for TCP/IPv6.
TCP/IPv4 allows an implicit connection setup using sendto(), which is used for TTCP and TCP fast open. This patch adds support for TCP/IPv6. While t
Allow implicit TCP connection setup for TCP/IPv6.
TCP/IPv4 allows an implicit connection setup using sendto(), which is used for TTCP and TCP fast open. This patch adds support for TCP/IPv6. While there, improve some tests for detecting multicast addresses, which are mapped.
Reviewed by: bz@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16458
show more ...
|
#
8db239dc |
| 30-Jul-2018 |
Michael Tuexen <tuexen@FreeBSD.org> |
Fix some TCP fast open issues.
The following issues are fixed: * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), and close() before the TCP-ACK segment has been r
Fix some TCP fast open issues.
The following issues are fixed: * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), and close() before the TCP-ACK segment has been received, the TCP connection is just dropped and the reception of the TCP-ACK segment triggers the sending of a TCP-RST segment. * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), send(), and close() before the TCP-ACK segment has been received, the first byte provided in the second send call is not transferred. * Whenever a TCP client with TCP fast open enabled calls sendto() followed by close() the TCP connection is just dropped.
Reviewed by: jtl@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16485
show more ...
|
#
22699887 |
| 22-Jul-2018 |
Matt Macy <mmacy@FreeBSD.org> |
NULL out cc_data in pluggable TCP {cc}_cb_destroy
When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has a destructor (newreno_cb_destroy) for per connection state. Other co
NULL out cc_data in pluggable TCP {cc}_cb_destroy
When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has a destructor (newreno_cb_destroy) for per connection state. Other congestion controls may allocate and free cc_data on entry and exit, but the field is never explicitly NULLed if moving back to NewReno which only internally allocates stateful data (no entry contstructor) resulting in a situation where newreno_cb_destory might be called on a junk pointer.
- NULL out cc_data in the framework after calling {cc}_cb_destroy - free(9) checks for NULL so there is no need to perform not NULL checks before calling free. - Improve a comment about NewReno in tcp_ccalgounload
This is the result of a debugging session from Jason Wolfe, Jason Eggleston, and mmacy@ and very helpful insight from lstewart@.
Submitted by: Kevin Bowling Reviewed by: lstewart Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16282
show more ...
|
#
6573d758 |
| 04-Jul-2018 |
Matt Macy <mmacy@FreeBSD.org> |
epoch(9): allow preemptible epochs to compose
- Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simpli
epoch(9): allow preemptible epochs to compose
- Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate.
Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
show more ...
|
Revision tags: release/11.2.0 |
|
#
fd389e7c |
| 19-Apr-2018 |
Randall Stewart <rrs@FreeBSD.org> |
These two modules need the tcp_hpts.h file for when the option is enabled (not sure how LINT/build-universe missed this) opps.
Sponsored by: Netflix Inc
|
#
3ee9c3c4 |
| 19-Apr-2018 |
Randall Stewart <rrs@FreeBSD.org> |
This commit brings in the TCP high precision timer system (tcp_hpts). It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is opt
This commit brings in the TCP high precision timer system (tcp_hpts). It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is optional and requires the TCPHPTS option to be enabled before the feature will be active. TCP modules that use it must assure that the base component is compile in the kernel in which they are loaded.
MFC after: Never Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15020
show more ...
|
#
8fa799bd |
| 06-Apr-2018 |
Jonathan T. Looney <jtl@FreeBSD.org> |
If a user closes the socket before we call tcp_usr_abort(), then tcp_drop() may unlock the INP. Currently, tcp_usr_abort() does not check for this case, which results in a panic while trying to unlo
If a user closes the socket before we call tcp_usr_abort(), then tcp_drop() may unlock the INP. Currently, tcp_usr_abort() does not check for this case, which results in a panic while trying to unlock the already-unlocked INP (not to mention, a use-after-free violation).
Make tcp_usr_abort() check the return value of tcp_drop(). In the case where tcp_drop() returns NULL, tcp_usr_abort() can skip further steps to abort the connection and simply unlock the INP_INFO lock prior to returning.
Reviewed by: glebius MFC after: 2 weeks Sponsored by: Netflix, Inc.
show more ...
|
#
c73b6f4d |
| 04-Apr-2018 |
Ed Maste <emaste@FreeBSD.org> |
Fix kernel memory disclosure in tcp_ctloutput
strcpy was used to copy a string into a buffer copied to userland, which left uninitialized data after the terminating 0-byte. Use the same approach as
Fix kernel memory disclosure in tcp_ctloutput
strcpy was used to copy a string into a buffer copied to userland, which left uninitialized data after the terminating 0-byte. Use the same approach as in tcp_subr.c: strncpy and explicit '\0'.
admbugs: 765, 822 MFC after: 1 day Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Reported by: Vlad Tsyrklevich Security: Kernel memory disclosure Sponsored by: The FreeBSD Foundation
show more ...
|