#
000c42fa |
| 03-Mar-2020 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
ip6_output: use new routing KPI when not passed a cached route
Implement the equivalent of r347375 (IPv4) for the IPv6 output path. In IPv6 we get passed a cached route (and inp) by udp6_output() de
ip6_output: use new routing KPI when not passed a cached route
Implement the equivalent of r347375 (IPv4) for the IPv6 output path. In IPv6 we get passed a cached route (and inp) by udp6_output() depending on whether we acquired a write lock on the INP. In case we neither bind nor connect a first UDP packet would come in with a cached route (wlocked) and all further packets would not. In case we bind and do not connect we never write-lock the inp.
When we do not pass in a cached route, rather than providing the storage for a route locally and pass it over the old lookup code and down the stack, use the new route lookup KPI and acquire all details we need to send the packet.
Compared to the IPv4 code the IPv6 code has a couple of possible complications: given an option with a routing hdr/caching route there, and path mtu (ro_pmtu) case which now equally has to deal with the possibility of having a route which is NULL passed in, and the fwd_tag in case a firewall changes the next hop (something to factor out in the future).
Sponsored by: Netflix Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D23886
show more ...
|
#
75dfc66c |
| 27-Feb-2020 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r358269 through r358399.
|
#
3db60531 |
| 25-Feb-2020 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
ip6_output: fix regression introduced in r358167 for ipv6 fragmentation
When moving the calculations for the optlen into the if (opt) block which deals with possible extension headers I failed to in
ip6_output: fix regression introduced in r358167 for ipv6 fragmentation
When moving the calculations for the optlen into the if (opt) block which deals with possible extension headers I failed to initialise unfragpartlen to the ipv6 header length if there were no extension headers present. Correct that mistake to make IPv6 fragment length calculcations work again.
Reported by: hselasky, kp OKed by: hselasky, kp MFC after: 3 days X-MFC with: r358167 PR: 244393
show more ...
|
#
3459050c |
| 24-Feb-2020 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Fix IPv6 checksums when exthdrs are present.
In two places in ip6_output we are doing (delayed) checksum calculations. The initial logic came from SCTP in r205075,205104 and later I copied and adjus
Fix IPv6 checksums when exthdrs are present.
In two places in ip6_output we are doing (delayed) checksum calculations. The initial logic came from SCTP in r205075,205104 and later I copied and adjusted it for the TCP|UDP case in r235958. The problem was that the original SCTP offsets were already wrong for any case with extension headers present given IPv6 extension headers are not part of the pseudo checksum calculations. The later changes do not help in case there is checksum offloading as for extension headers (incl. fragments) we do currrently never offload as we have no infrastructure to know whether the NIC can handle these cases.
Correct the offsets for delayed checksum calculations and properly handle mbuf flags. In addition harmonize the almost identical duplicate code.
While here eliminate the now unneeded variable hlen and add an always missing mtod() call in the 1-b and 3 cases after the introduction of the mb_unmapped_to_ext() calls.
Reported by: Francis Dupont (fdupont isc.org) PR: 243675 MFC after: 6 days Reviewed by: markj (earlier version), gallatin Differential Revision: https://reviews.freebsd.org/D23760
show more ...
|
#
6c140a72 |
| 20-Feb-2020 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r358131 through r358178.
|
#
a1a6c01e |
| 20-Feb-2020 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
ip6_output: improve extension header handling
Move IPv6 source address checks from after extension header heandling to the top of the function. If we do not pass these checks there is no reason to d
ip6_output: improve extension header handling
Move IPv6 source address checks from after extension header heandling to the top of the function. If we do not pass these checks there is no reason to do a lot of work upfront.
Fold extension header preparations and length calculations together into a single branch and macro rather than doing them sequentially. Likewise move extension header concatination into a single branch block only doing it if we recorded any extension header length length.
Reviewed by: melifaro (earlier version), markj, gallatin Sponsored by: Netflix (partially, originally) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23740
show more ...
|
#
abaad9d7 |
| 18-Feb-2020 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r358049 through r358074.
|
#
7c1daefe |
| 18-Feb-2020 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
ip6_output: update comments.
Clear up some comments and improve to panic messages.
No functional changes.
MFC after: 3 days
|
#
051669e8 |
| 25-Jan-2020 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r356931 through r357118.
|
#
b9555453 |
| 22-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Make ip6_output() and ip_output() require network epoch.
All callers that before may called into these functions without network epoch now must enter it.
|
#
e00ee1a9 |
| 01-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
In r343631 error code for a packet blocked by a firewall was changed from EACCES to EPERM. This change was not intentional, so fix that. Return EACCESS if a firewall forbids sending.
Noticed by: ae
|
Revision tags: release/12.1.0 |
|
#
1e4f4e56 |
| 09-Oct-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
ip6_output() has a complex set of gotos, and some can jump out of the epoch section towards return statement. Since entering epoch is cheap, it is easier to cover the whole function with epoch, rathe
ip6_output() has a complex set of gotos, and some can jump out of the epoch section towards return statement. Since entering epoch is cheap, it is easier to cover the whole function with epoch, rather than try to properly maintain its state.
show more ...
|
#
8b3bc70a |
| 08-Oct-2019 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r352764 through r353315.
|
#
cb49ec54 |
| 08-Oct-2019 |
Mark Johnston <markj@FreeBSD.org> |
Improve locking in the IPV6_V6ONLY socket option handler.
Acquire the inp lock before checking whether the socket is already bound, and around updates to the inp_vflag field.
MFC after: 1 week Spon
Improve locking in the IPV6_V6ONLY socket option handler.
Acquire the inp lock before checking whether the socket is already bound, and around updates to the inp_vflag field.
MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21867
show more ...
|
#
b8a6e03f |
| 08-Oct-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex cover
Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier.
On output path we already enter epoch quite early - in the ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111
show more ...
|
#
c5c3ba6b |
| 03-Sep-2019 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r351317 through r351731.
|
#
b2e60773 |
| 27-Aug-2019 |
John Baldwin <jhb@FreeBSD.org> |
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for tr
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.)
KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277
show more ...
|
#
0ecd976e |
| 02-Aug-2019 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaini
IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance.
Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD.
* Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6.
The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing.
No functional changes.
Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix
show more ...
|
#
a63915c2 |
| 28-Jul-2019 |
Alan Somers <asomers@FreeBSD.org> |
MFHead @r350386
Sponsored by: The FreeBSD Foundation
|
Revision tags: release/11.3.0 |
|
#
82334850 |
| 29-Jun-2019 |
John Baldwin <jhb@FreeBSD.org> |
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required.
If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum.
Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616
show more ...
|
#
e532a999 |
| 20-Jun-2019 |
Alan Somers <asomers@FreeBSD.org> |
MFHead @349234
Sponsored by: The FreeBSD Foundation
|
#
77a01441 |
| 12-Jun-2019 |
John Baldwin <jhb@FreeBSD.org> |
Sort opt_foo.h #includes and add a missing blank line in ip_output().
|
#
0269ae4c |
| 06-Jun-2019 |
Alan Somers <asomers@FreeBSD.org> |
MFHead @348740
Sponsored by: The FreeBSD Foundation
|
#
fb3bc596 |
| 25-May-2019 |
John Baldwin <jhb@FreeBSD.org> |
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), i
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds.
To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN.
- To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag.
- mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead.
- Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117
show more ...
|
#
7648bc9f |
| 13-May-2019 |
Alan Somers <asomers@FreeBSD.org> |
MFHead @347527
Sponsored by: The FreeBSD Foundation
|