#
de156263 |
| 26-Oct-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Several IP level socket options may affect TCP.
After handling them in IP level ctloutput, pass them down to TCP ctloutput.
We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place for
Several IP level socket options may affect TCP.
After handling them in IP level ctloutput, pass them down to TCP ctloutput.
We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place for now, but comment out how it should be handled.
For IPv4 we are interested in IP_TOS and IP_TTL.
Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
show more ...
|
#
fc4d53cc |
| 26-Oct-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Split tcp_ctloutput() into set/get parts.
Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
|
#
e2833083 |
| 26-Oct-2021 |
Peter Lei <peterlei@netflix.com> |
tcp: socket option to get stack alias name
TCP stack sysctl nodes are currently inserted using the stack name alias. Allow the user to get the current stack's alias to allow for programatic sysctl a
tcp: socket option to get stack alias name
TCP stack sysctl nodes are currently inserted using the stack name alias. Allow the user to get the current stack's alias to allow for programatic sysctl access.
Obtained from: Netflix
show more ...
|
#
bf256782 |
| 17-Sep-2021 |
Mark Johnston <markj@FreeBSD.org> |
ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so errors are effectively hidden. Fix this by using a separate o
ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so errors are effectively hidden. Fix this by using a separate output parameter. Convert to the new socket buffer locking macros while here.
Note that the socket buffer lock is not needed to synchronize the SOLISTENING check here, we can rely on the PCB lock.
Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31977
show more ...
|
#
bd4a39cc |
| 07-Sep-2021 |
Mark Johnston <markj@FreeBSD.org> |
socket: Properly interlock when transitioning to a listening socket
Currently, most protocols implement pru_listen with something like the following:
SOCK_LOCK(so); error = solisten_proto_check(s
socket: Properly interlock when transitioning to a listening socket
Currently, most protocols implement pru_listen with something like the following:
SOCK_LOCK(so); error = solisten_proto_check(so); if (error) { SOCK_UNLOCK(so); return (error); } solisten_proto(so); SOCK_UNLOCK(so);
solisten_proto_check() fails if the socket is connected or connecting. However, the socket lock is not used during I/O, so this pattern is racy.
The change modifies solisten_proto_check() to additionally acquire socket buffer locks, and the calling thread holds them until solisten_proto() or solisten_proto_abort() is called. Now that the socket buffer locks are preserved across a listen(2), this change allows socket I/O paths to properly interlock with listen(2).
This fixes a large number of syzbot reports, only one is listed below and the rest will be dup'ed to it.
Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31659
show more ...
|
#
3f1f6b6e |
| 05-Aug-2021 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp, udp: improve input validation in handling bind()
Reported by: syzbot+24fcfd8057e9bc339295@syzkaller.appspotmail.com Reported by: syzbot+6e90ceb5c89285b2655b@syzkaller.appspotmail.com Reviewed
tcp, udp: improve input validation in handling bind()
Reported by: syzbot+24fcfd8057e9bc339295@syzkaller.appspotmail.com Reported by: syzbot+6e90ceb5c89285b2655b@syzkaller.appspotmail.com Reviewed by: markj, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31422
show more ...
|
#
4747500d |
| 04-Jun-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: A better fix for the previously attempted fix of the ack-war issue with tcp.
So it turns out that my fix before was not correct. It ended with us failing some of the "improved" SYN tests, since
tcp: A better fix for the previously attempted fix of the ack-war issue with tcp.
So it turns out that my fix before was not correct. It ended with us failing some of the "improved" SYN tests, since we are not in the correct states. With more digging I have figured out the root of the problem is that when we receive a SYN|FIN the reassembly code made it so we create a segq entry to hold the FIN. In the established state where we were not in order this would be correct i.e. a 0 len with a FIN would need to be accepted. But if you are in a front state we need to strip the FIN so we correctly handle the ACK but ignore the FIN. This gets us into the proper states and avoids the previous ack war.
I back out some of the previous changes but then add a new change here in tcp_reass() that fixes the root cause of the issue. We still leave the rack panic fixes in place however.
Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30627
show more ...
|
#
f96603b5 |
| 01-Jun-2021 |
Mark Johnston <markj@FreeBSD.org> |
tcp, udp: Permit binding with AF_UNSPEC if the address is INADDR_ANY
Prior to commit f161d294b we only checked the sockaddr length, but now we verify the address family as well. This breaks at leas
tcp, udp: Permit binding with AF_UNSPEC if the address is INADDR_ANY
Prior to commit f161d294b we only checked the sockaddr length, but now we verify the address family as well. This breaks at least ttcp. Relax the check to avoid breaking compatibility too much: permit AF_UNSPEC if the address is INADDR_ANY.
Fixes: f161d294b Reported by: Bakul Shah <bakul@iitbombay.org> Reviewed by: tuexen MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30539
show more ...
|
#
086a3556 |
| 25-May-2021 |
Andrew Gallatin <gallatin@FreeBSD.org> |
tcp: enter network epoch when calling tfb_tcp_fb_fini
We need to enter the network epoch when calling into tfb_tcp_fb_fini. I noticed this when I hit an assert running the latest rack
Differential
tcp: enter network epoch when calling tfb_tcp_fb_fini
We need to enter the network epoch when calling into tfb_tcp_fb_fini. I noticed this when I hit an assert running the latest rack
Differential Revision: https://reviews.freebsd.org/D30407 Reviewed by: rrs, tuexen Sponsored by: Netflix
show more ...
|
#
13c0e198 |
| 25-May-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: Fix bugs related to the PUSH bit and rack and an ack war
Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the
tcp: Fix bugs related to the PUSH bit and rack and an ack war
Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the left edge gets transmitted before the adjustments are done to the send_map, this means that right edge bits must be considered to be added only if the entire RSM is being retransmitted.
Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself. After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically what happens is we go into the reassembly code and lose the FIN bit. The trick here is we should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything. That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no timers running. This is because the usrclosed function gets called and the FIN's and such have already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2 timer can speed this up in testing.
Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30451
show more ...
|
#
7d2608a5 |
| 21-May-2021 |
Mark Johnston <markj@FreeBSD.org> |
tcp: Make error handling in tcp_usr_send() more consistent
- Free the input mbuf in a single place instead of in every error path. - Handle PRUS_NOTREADY consistently. - Flush the socket's send buff
tcp: Make error handling in tcp_usr_send() more consistent
- Free the input mbuf in a single place instead of in every error path. - Handle PRUS_NOTREADY consistently. - Flush the socket's send buffer if an implicit connect fails. At that point the mbuf has already been enqueued but we don't want to keep it in the send buffer.
Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349
show more ...
|
#
d8acd268 |
| 12-May-2021 |
Mark Johnston <markj@FreeBSD.org> |
Fix mbuf leaks in various pru_send implementations
The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and
Fix mbuf leaks in various pru_send implementations
The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed.
This diff plugs various leaks in the pru_send implementations.
Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151
show more ...
|
#
0471a8c7 |
| 10-May-2021 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
tcp: SACK Lost Retransmission Detection (LRD)
Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1
Review
tcp: SACK Lost Retransmission Detection (LRD)
Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1
Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931
show more ...
|
#
f161d294 |
| 03-May-2021 |
Mark Johnston <markj@FreeBSD.org> |
Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were valida
Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate.
Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519
show more ...
|
#
9e644c23 |
| 18-Apr-2021 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support
tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs.
Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469
show more ...
|
Revision tags: release/13.0.0 |
|
#
4c0bef07 |
| 21-Jan-2021 |
Kyle Evans <kevans@FreeBSD.org> |
kern: net: remove TCP_LINGERTIME
TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in exactly the same form that it appears here modulo slightly different context. It used to be
kern: net: remove TCP_LINGERTIME
TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in exactly the same form that it appears here modulo slightly different context. It used to be the case that there was a single pr_usrreq method with requests dispatched to it; these exact two lines appeared in tcp_usrreq's PRU_ATTACH handling.
The only purpose of this that I can find is to cause surprising behavior on accepted connections. Newly-created sockets will never hit these paths as one cannot set SO_LINGER prior to socket(2). If SO_LINGER is set on a listening socket and inherited, one would expect the timeout to be inherited rather than changed arbitrarily like this -- noting that SO_LINGER is nonsense on a listening socket beyond inheritance, since they cannot be 'connected' by definition.
Neither Illumos nor Linux reset the timer like this based on testing and inspection of Illumos, and testing of Linux.
Reviewed by: rscheff, tuexen Differential Revision: https://reviews.freebsd.org/D28265
show more ...
|
#
a034518a |
| 19-Dec-2020 |
Andrew Gallatin <gallatin@FreeBSD.org> |
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
show more ...
|
Revision tags: release/12.2.0 |
|
#
662c1305 |
| 01-Sep-2020 |
Mateusz Guzik <mjg@FreeBSD.org> |
net: clean up empty lines in .c and .h files
|
#
e2c0e292 |
| 16-Jul-2020 |
Glen Barber <gjb@FreeBSD.org> |
MFH
Sponsored by: Rubicon Communications, LLC (netgate.com)
|
#
f903a308 |
| 16-Jul-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
(Re)-allow 0.0.0.0 to be used as an address in connect() for TCP In r361752 an error handling was introduced for using 0.0.0.0 or 255.255.255.255 as the address in connect() for TCP, since both addre
(Re)-allow 0.0.0.0 to be used as an address in connect() for TCP In r361752 an error handling was introduced for using 0.0.0.0 or 255.255.255.255 as the address in connect() for TCP, since both addresses can't be used. However, the stack maps 0.0.0.0 implicitly to a local address and at least two regressions were reported. Therefore, re-allow the usage of 0.0.0.0. While there, change the error indicated when using 255.255.255.255 from EAFNOSUPPORT to EACCES as mentioned in the man-page of connect().
Reviewed by: rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D25401
show more ...
|
Revision tags: release/11.4.0 |
|
#
e854dd38 |
| 08-Jun-2020 |
Randall Stewart <rrs@FreeBSD.org> |
An important statistic in determining if a server process (or client) is being delayed is to know the time to first byte in and time to first byte out. Currently we have no way to know these all we h
An important statistic in determining if a server process (or client) is being delayed is to know the time to first byte in and time to first byte out. Currently we have no way to know these all we have is t_starttime. That (t_starttime) tells us what time the 3 way handshake completed. We don't know when the first request came in or how quickly we responded. Nor from a client perspective do we know how long from when we sent out the first byte before the server responded.
This small change adds the ability to track the TTFB's. This will show up in BB logging which then can be pulled for later analysis. Note that currently the tracking is via the ticks variable of all three variables. This provides a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be to change all three of these values into something with a much finer resolution (either microseconds or nanoseconds), though we may want to make the resolution configurable so that on lower powered machines we could still use the much cheaper ticks variable.
Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24902
show more ...
|
#
2cf21ae5 |
| 03-Jun-2020 |
Randall Stewart <rrs@FreeBSD.org> |
We should never allow either the broadcast or IN_ADDR_ANY to be connected to or sent to. This was fond when working with Michael Tuexen and Skyzaller. Skyzaller seems to want to use either of these t
We should never allow either the broadcast or IN_ADDR_ANY to be connected to or sent to. This was fond when working with Michael Tuexen and Skyzaller. Skyzaller seems to want to use either of these two addresses to connect to at times. And it really is an error to do so, so lets not allow that behavior.
Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24852
show more ...
|
#
d442a657 |
| 03-Jun-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
Restrict enabling TCP-FASTOPEN to end-points in CLOSED or LISTEN state
Enabling TCP-FASTOPEN on an end-point which is in a state other than CLOSED or LISTEN, is a bug in the application. So it shoul
Restrict enabling TCP-FASTOPEN to end-points in CLOSED or LISTEN state
Enabling TCP-FASTOPEN on an end-point which is in a state other than CLOSED or LISTEN, is a bug in the application. So it should not work. Also the TCP code does not (and needs not to) handle this. While there, also simplify the setting of the TF_FASTOPEN flag.
This issue was found by running syzkaller.
Reviewed by: rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D25115
show more ...
|
#
25102351 |
| 19-May-2020 |
Mike Karels <karels@FreeBSD.org> |
Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting, forcing the local port to be unique across all outgoing TCP connections
Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting, forcing the local port to be unique across all outgoing TCP connections for the address family. Instead, choose a local port after selecting the destination and the local address, requiring only that the tuple is unique and does not match a wildcard binding.
Reviewed by: tuexen (rscheff, rrs previous version) MFC after: 1 month Sponsored by: Forcepoint LLC Differential Revision: https://reviews.freebsd.org/D24781
show more ...
|
#
e240ce42 |
| 15-May-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
Allow only IPv4 addresses in sendto() for TCP on AF_INET sockets.
This problem was found by looking at syzkaller reproducers for some other problems.
Reviewed by: rrs Sponsored by: Netflix, Inc.
Allow only IPv4 addresses in sendto() for TCP on AF_INET sockets.
This problem was found by looking at syzkaller reproducers for some other problems.
Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24831
show more ...
|