#
a034518a |
| 19-Dec-2020 |
Andrew Gallatin <gallatin@FreeBSD.org> |
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
show more ...
|
#
98d7a8d9 |
| 29-Oct-2020 |
John Baldwin <jhb@FreeBSD.org> |
Call m_snd_tag_rele() to free send tags.
Send tags are refcounted and if_snd_tag_free() is called by m_snd_tag_rele() when the last reference is dropped on a send tag.
Reviewed by: gallatin, hselas
Call m_snd_tag_rele() to free send tags.
Send tags are refcounted and if_snd_tag_free() is called by m_snd_tag_rele() when the last reference is dropped on a send tag.
Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26995
show more ...
|
Revision tags: release/12.2.0 |
|
#
868aabb4 |
| 09-Oct-2020 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt) option IP(V6)_VLAN_PCP, which can be set to -1 (interface default), or explicitly to
Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt) option IP(V6)_VLAN_PCP, which can be set to -1 (interface default), or explicitly to any priority between 0 and 7.
Note that for untagged traffic, explicitly adding a priority will insert a special 801.1Q vlan header with vlan ID = 0 to carry the priority setting
Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26409
show more ...
|
Revision tags: release/11.4.0 |
|
#
25102351 |
| 19-May-2020 |
Mike Karels <karels@FreeBSD.org> |
Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting, forcing the local port to be unique across all outgoing TCP connections
Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting, forcing the local port to be unique across all outgoing TCP connections for the address family. Instead, choose a local port after selecting the destination and the local address, requiring only that the tuple is unique and does not match a wildcard binding.
Reviewed by: tuexen (rscheff, rrs previous version) MFC after: 1 month Sponsored by: Forcepoint LLC Differential Revision: https://reviews.freebsd.org/D24781
show more ...
|
#
44e86fbd |
| 13-Feb-2020 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r357662 through r357854.
|
#
481be5de |
| 12-Feb-2020 |
Randall Stewart <rrs@FreeBSD.org> |
White space cleanup -- remove trailing tab's or spaces from any line.
Sponsored by: Netflix Inc.
|
#
fae994f6 |
| 15-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Stop header pollution and don't include if_var.h via in_pcb.h.
|
#
fe1274ee |
| 12-Jan-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
Fix race when accepting TCP connections.
When expanding a SYN-cache entry to a socket/inp a two step approach was taken: 1) The local address was filled in, then the inp was added to the hash tab
Fix race when accepting TCP connections.
When expanding a SYN-cache entry to a socket/inp a two step approach was taken: 1) The local address was filled in, then the inp was added to the hash table. 2) The remote address was filled in and the inp was relocated in the hash table. Before the epoch changes, a write lock was held when this happens and the code looking up entries was holding a corresponding read lock. Since the read lock is gone away after the introduction of the epochs, the half populated inp was found during lookup. This resulted in processing TCP segments in the context of the wrong TCP connection. This patch changes the above procedure in a way that the inp is fully populated before inserted into the hash table.
Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@ mailing list and for testing the patch!
Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22971
show more ...
|
#
273b2e4c |
| 07-Nov-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Remove now unused INP_INFO_RLOCK macros.
|
#
5015a05f |
| 07-Nov-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Remove now unused INP_HASH_RLOCK() macros.
|
#
5a126433 |
| 07-Nov-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Add INP_UNLOCK() which will do whatever R/W unlock is required.
|
Revision tags: release/12.1.0 |
|
#
0ecd976e |
| 02-Aug-2019 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaini
IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance.
Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD.
* Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6.
The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing.
No functional changes.
Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix
show more ...
|
#
20abea66 |
| 01-Aug-2019 |
Randall Stewart <rrs@FreeBSD.org> |
This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit.
Sponsored by: Netflix Inc. Differential Revision: http
This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit.
Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20953
show more ...
|
#
a63915c2 |
| 28-Jul-2019 |
Alan Somers <asomers@FreeBSD.org> |
MFHead @r350386
Sponsored by: The FreeBSD Foundation
|
#
3b0b41e6 |
| 10-Jul-2019 |
Randall Stewart <rrs@FreeBSD.org> |
This commit updates rack to what is basically being used at NF as well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the
This commit updates rack to what is basically being used at NF as well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the entrance of BBR. This is actually part 1 of 3 more needed commits which will finally complete with BBRv1 being added as a new tcp stack.
Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20834
show more ...
|
Revision tags: release/11.3.0 |
|
#
7648bc9f |
| 13-May-2019 |
Alan Somers <asomers@FreeBSD.org> |
MFHead @347527
Sponsored by: The FreeBSD Foundation
|
#
50575ce1 |
| 25-Apr-2019 |
Andrew Gallatin <gallatin@FreeBSD.org> |
Track TCP connection's NUMA domain in the inpcb
Drivers can now pass up numa domain information via the mbuf numa domain field. This information is then used by TCP syncache_socket() to associate t
Track TCP connection's NUMA domain in the inpcb
Drivers can now pass up numa domain information via the mbuf numa domain field. This information is then used by TCP syncache_socket() to associate that information with the inpcb. The domain information is then fed back into transmitted mbufs in ip{6}_output(). This mechanism is nearly identical to what is done to track RSS hash values in the inp_flowid.
Follow on changes will use this information for lacp egress port selection, binding TCP pacers to the appropriate NUMA domain, etc.
Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20028
show more ...
|
#
a68cc388 |
| 09-Jan-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro
Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively.
Discussed with: jtl, gallatin
show more ...
|
#
67350cb5 |
| 09-Dec-2018 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r340918 through r341763.
|
Revision tags: release/12.0.0 |
|
#
9d2877fc |
| 05-Dec-2018 |
Mark Johnston <markj@FreeBSD.org> |
Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains.
Memory beyond that limit was previously unused, wasting roughly 1MB per 8GB of RAM. Also retire INP_PCBLBGROUP_PORTHASH, which was identic
Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains.
Memory beyond that limit was previously unused, wasting roughly 1MB per 8GB of RAM. Also retire INP_PCBLBGROUP_PORTHASH, which was identical to INP_PCBPORTHASH.
Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D17803
show more ...
|
#
c6879c6c |
| 23-Oct-2018 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r339015 through r339669.
|
#
01d4e214 |
| 05-Oct-2018 |
Glen Barber <gjb@FreeBSD.org> |
MFH r338661 through r339200.
Sponsored by: The FreeBSD Foundation
|
#
384a5c3c |
| 01-Oct-2018 |
Andrey V. Elsukov <ae@FreeBSD.org> |
Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic it is possible, that the code is running in net_epoch_preempt section,
Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic it is possible, that the code is running in net_epoch_preempt section, and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.
PR: 231428 Reviewed by: mmacy, tuexen Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17335
show more ...
|
#
3af64f03 |
| 11-Sep-2018 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r338392 through r338594.
|
#
54af3d0d |
| 10-Sep-2018 |
Mark Johnston <markj@FreeBSD.org> |
Fix synchronization of LB group access.
Lookups are protected by an epoch section, so the LB group linkage must be a CK_LIST rather than a plain LIST. Furthermore, we were not deferring LB group fr
Fix synchronization of LB group access.
Lookups are protected by an epoch section, so the LB group linkage must be a CK_LIST rather than a plain LIST. Furthermore, we were not deferring LB group frees, so in_pcbremlbgrouphash() could race with readers and cause a use-after-free.
Reviewed by: sbruno, Johannes Lundberg <johalun0@gmail.com> Tested by: gallatin Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17031
show more ...
|