#
06bf119f |
| 28-Jan-2025 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets/tcp: quick fix for regression with SO_REUSEPORT_LB
There was a long living problem that pr_listen is called every time on consecutive listen(2) syscalls. Up until today it produces spurious
sockets/tcp: quick fix for regression with SO_REUSEPORT_LB
There was a long living problem that pr_listen is called every time on consecutive listen(2) syscalls. Up until today it produces spurious TCP state change events in tracing software and other harmless problems. But with 7cbb6b6e28db we started to call LIST_REMOVE() twice on the same entry.
This is quite ugly, but quick and robust fix against regression, that we decided to put in the scope of the January stabilization week. A better refactoring will happen later.
Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D48703 Fixes: 7cbb6b6e28db33095a1cf7a8887921a5ec969824
show more ...
|
#
7cbb6b6e |
| 23-Jan-2025 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Close some SO_REUSEPORT_LB races, part 2
Suppose a thread is adds a socket to an existing TCP lbgroup that is actively accepting connections. It has to do the following operations: 1. set SO
inpcb: Close some SO_REUSEPORT_LB races, part 2
Suppose a thread is adds a socket to an existing TCP lbgroup that is actively accepting connections. It has to do the following operations: 1. set SO_REUSEPORT_LB on the socket 2. bind() the socket to the shared address/port 3. call listen()
Step 2 makes the inpcb visible to incoming connection requests. However, at this point the inpcb cannot accept new connections. If in_pcblookup() matches it, the remote end will see ECONNREFUSED even when other listening sockets are present in the lbgroup. This means that dynamically adding inpcbs to an lbgroup (e.g., by starting up new workers) can trigger spurious connection failures for no good reason. (A similar problem exists when removing inpcbs from an lbgroup, but that is harder to fix and is not addressed by this patch; see the review for a bit more commentary.)
Fix this by augmenting each lbgroup with a linked list of inpcbs that are pending a listen() call. When adding an inpcb to an lbgroup, keep the inpcb on this list if listen() hasn't been called, so it is not yet visible to the lookup path. Then, add a new in_pcblisten() routine which makes the inpcb visible within the lbgroup now that it's safe to let it handle new connections.
Add a regression test which verifies that we don't get spurious connection errors while adding sockets to an LB group.
Reviewed by: glebius MFC after: 1 month Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D48544
show more ...
|
#
053a9884 |
| 23-Dec-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: don't ever return ECONNRESET on close(2)
The SUS doesn't mention this error code as a possible one [1]. The FreeBSD manual page specifies a possible ECONNRESET for close(2):
[ECONNRESET] The u
tcp: don't ever return ECONNRESET on close(2)
The SUS doesn't mention this error code as a possible one [1]. The FreeBSD manual page specifies a possible ECONNRESET for close(2):
[ECONNRESET] The underlying object was a stream socket that was shut down by the peer before all pending data was delivered.
In the past it had been EINVAL (see 21367f630d72), and this EINVAL was added as a safety measure in 623dce13c64ef. After conversion to ECONNRESET it had been documented in the manual page in 78e3a7fdd51e6, but I bet wasn't ever tested to actually be ever returned, cause the tcp-testsuite[2] didn't exist back then. So documentation is incorrect since 2006, if my bet wins. Anyway, in the modern FreeBSD the condition described above doesn't end up with ECONNRESET error code from close(2). The error condition is reported via SO_ERROR socket option, though. This can be checked using the tcp-testsuite, temporarily disabling the getsockopt(SO_ERROR) lines using sed command [3]. Most of these getsockopt(2)s are followed by '+0.00 close(3) = 0', which will confirm that close(2) doesn't return ECONNRESET even on a socket that has the error stored, neither it is returned in the case described in the manual page. The latter case is covered by multiple tests residing in tcp- testsuite/state-event-engine/rcv-rst-*.
However, the deleted block of code could be entered in a race condition between close(2) and processing of incoming packet, when connection had already been half-closed with shutdown(SHUT_WR) and sits in TCPS_LAST_ACK. This was reported in the bug 146845. With the block deleted, we will continue into tcp_disconnect() which has proper handling of INP_DROPPED.
The race explanation follows. The connection is in TCPS_LAST_ACK. The network input thread acquires the tcpcb lock first, sets INP_DROPPED, acquires the socket lock in soisdisconnected() and clears SS_ISCONNECTED. Meanwhile, the syscall thread goes through sodisconnect() which checks for SS_ISCONNECTED locklessly(!). The check passes and the thread blocks on the tcpcb lock in tcp_usr_disconnect(). Once input thread releases the lock, the syscall thread observes INP_DROPPED and returns ECONNRESET.
- Thread 1: tcp_do_segment()->tcp_close()->in_pcbdrop(),soisdisconnected() - Thread 2: sys_close()...->soclose()->sodisconnect()->tcp_usr_disconnect()
Note that the lockless operation in sodisconnect() isn't correct, but enforcing the socket lock there will not fix the problem.
[1] https://pubs.opengroup.org/onlinepubs/9799919799/ [2] https://github.com/freebsd-net/tcp-testsuite [3] sed -i "" -Ee '/\+0\.00 getsockopt\(3, SOL_SOCKET, SO_ERROR, \[ECONNRESET\]/d' $(grep -lr ECONNRESET tcp-testsuite)
PR: 146845 Reviewed by: tuexen, rrs, imp Differential Revision: https://reviews.freebsd.org/D48148
show more ...
|
#
c91dd7a0 |
| 19-Dec-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove unused variable from tcp_usr_disconnect()
|
Revision tags: release/14.2.0 |
|
#
0b4539ee |
| 14-Nov-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: gc unused argument of in_pcbconnect()
|
#
dded4e9e |
| 14-Nov-2024 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
tcp: change SOCKBUF_* macros to SOCK_[RECV|SEND]BUF_* macros
Change the older LOCK related macros over to the dedicated send/recv buffer macros in the base tcp stack.
No functional change intended.
tcp: change SOCKBUF_* macros to SOCK_[RECV|SEND]BUF_* macros
Change the older LOCK related macros over to the dedicated send/recv buffer macros in the base tcp stack.
No functional change intended.
Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D47567
show more ...
|
Revision tags: release/13.4.0 |
|
#
093d9b46 |
| 05-Aug-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
ddb: update printing of t_flags and tflags2
Update the ddb printing of t_flags and t_flags2 to the current state of definitions in tcp_var.h.
Reviewed by: cc MFC after: 1 week Sponsored by: Netf
ddb: update printing of t_flags and tflags2
Update the ddb printing of t_flags and t_flags2 to the current state of definitions in tcp_var.h.
Reviewed by: cc MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D46222
show more ...
|
#
00d3b744 |
| 28-Jul-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp cc: remove non-working sctp support
As suggested by lstewart, remove the non-working SCTP support in the TCP congestion control modules. SCTP has a similar functionality (although not using kern
tcp cc: remove non-working sctp support
As suggested by lstewart, remove the non-working SCTP support in the TCP congestion control modules. SCTP has a similar functionality (although not using kernel loadable modules), on which the TCP stuff was built on, but the integration was never done. No functional change intended.
Reviewed by: Peter Lei, cc Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D46142
show more ...
|
#
baee801c |
| 21-Jul-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: simplify endpoint creation at the passive side
Use the intended TCP stack when creating a TCP endpoint instead of creating it the endpoint the default stack first and after that switching it to
tcp: simplify endpoint creation at the passive side
Use the intended TCP stack when creating a TCP endpoint instead of creating it the endpoint the default stack first and after that switching it to use the intended TCP stack. Reviewed by: Peter Lei, rrs and jtl (older version) Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D45411
show more ...
|
#
86c9325d |
| 06-Jun-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: simplify stack switching protocol
Before this patch, a stack (tfb) accepts a tcpcb (tp), if the tp->t_state is TCPS_CLOSED or tfb->tfb_tcp_handoff_ok is not NULL and tfb->tfb_tcp_handoff_ok(tp)
tcp: simplify stack switching protocol
Before this patch, a stack (tfb) accepts a tcpcb (tp), if the tp->t_state is TCPS_CLOSED or tfb->tfb_tcp_handoff_ok is not NULL and tfb->tfb_tcp_handoff_ok(tp) returns 0. After this patch, the only check is tfb->tfb_tcp_handoff_ok(tp) returns 0. tfb->tfb_tcp_handoff_ok must always be provided. For existing TCP stacks (FreeBSD, RACK and BBR) there is no functional change. However, the logic is simpler.
Reviewed by: lstewart, peter_lei_ieee_.org, rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D45253
show more ...
|
Revision tags: release/14.1.0 |
|
#
e7381521 |
| 30-May-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: remove unused code in tcp_usr_attach
pr_attach is only called on a socket (so) with so->so_listen != NULL via sonewconn. However, sonewconn is not called from the TCP code. The listening socket
tcp: remove unused code in tcp_usr_attach
pr_attach is only called on a socket (so) with so->so_listen != NULL via sonewconn. However, sonewconn is not called from the TCP code. The listening sockets are handled in tcp_syncache.c without using sonewconn. Therefore, the code removed is never executed. No functional change intended.
Reviewed by: rrs, peter.lei_ieee.org MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D45412
show more ...
|
#
fe136aec |
| 23-May-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: improve inp locking in setsockopt
Ensure that the inp is not dropped when starting a stack switch. While there, clean-up the code by using INP_WLOCK_RECHECK, which also re-assigns tp.
Reviewed
tcp: improve inp locking in setsockopt
Ensure that the inp is not dropped when starting a stack switch. While there, clean-up the code by using INP_WLOCK_RECHECK, which also re-assigns tp.
Reviewed by: glebius MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D45241
show more ...
|
#
fce03f85 |
| 05-May-2024 |
Randall Stewart <rrs@FreeBSD.org> |
TCP can be subject to Sack Attacks lets fix this issue.
There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if i
TCP can be subject to Sack Attacks lets fix this issue.
There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:
ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704) This first sack looks fine but then the attacker sends
ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705) ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707) ... These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.
To combat this we introduce three things.
when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing. Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks. The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway. The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).
After this set of changes is in rack can drop its SAD detection completely
Reviewed by:tuexen@, rscheff@ Differential Revision: <https://reviews.freebsd.org/D44903>
show more ...
|
#
dd7b86e2 |
| 18-Mar-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove IS_FASTOPEN() macro
The macro is more obfuscating than helping as it just checks a single flag of t_flags. All other t_flags bits are checked without a macro.
A bigger problem was that
tcp: remove IS_FASTOPEN() macro
The macro is more obfuscating than helping as it just checks a single flag of t_flags. All other t_flags bits are checked without a macro.
A bigger problem was that declaration of the macro in tcp_var.h depended on a kernel option. It is a bad practice to create such definitions in installable headers.
Reviewed by: rscheff, tuexen, kib Differential Revision: https://reviews.freebsd.org/D44362
show more ...
|
#
85df11a1 |
| 13-Mar-2024 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
ktls: deep copy tls_enable struct for in-kernel tcp consumers
Doing a deep copy of the keys early allows users of the tls_enable structure to assume kernel memory. This enables the socket options to
ktls: deep copy tls_enable struct for in-kernel tcp consumers
Doing a deep copy of the keys early allows users of the tls_enable structure to assume kernel memory. This enables the socket options to be set by kernel threads.
Reviewed By: #transport, tuexen, jhb, rrs Sponsored by: NetApp, Inc. X-NetApp-PR: #79 Differential Revision: https://reviews.freebsd.org/D44250
show more ...
|
#
e18b97bd |
| 12-Mar-2024 |
Randall Stewart <rrs@FreeBSD.org> |
Update to bring the rack stack with all its fixes in.
This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal w
Update to bring the rack stack with all its fixes in.
This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output.
It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements.
With a bit of a struggle with git I finally got rack_pcm.c into place (apologies for not noticing this error). The LINT kernel is running on my box now .. sigh.
Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
show more ...
|
#
c112243f |
| 11-Mar-2024 |
Brooks Davis <brooks@FreeBSD.org> |
Revert "Update to bring the rack stack with all its fixes in."
This commit was incomplete and breaks LINT kernels. The tree has been broken for 8+ hours.
This reverts commit f6d489f402c320f1a6eaa4
Revert "Update to bring the rack stack with all its fixes in."
This commit was incomplete and breaks LINT kernels. The tree has been broken for 8+ hours.
This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.
show more ...
|
#
f6d489f4 |
| 11-Mar-2024 |
Randall Stewart <rrs@FreeBSD.org> |
Update to bring the rack stack with all its fixes in.
This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal w
Update to bring the rack stack with all its fixes in.
This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output.
Note there is a new file that I can't figure out how to get in rack_pcm.c
It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements.
Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
show more ...
|
Revision tags: release/13.3.0 |
|
#
abe8379b |
| 15-Feb-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: repair wakeup of accept(2) by shutdown(2)
That was lost in transition from one-for-all soshutdown() to protocol specific methods. Only protocols that listen(2) were affected. This is not
sockets: repair wakeup of accept(2) by shutdown(2)
That was lost in transition from one-for-all soshutdown() to protocol specific methods. Only protocols that listen(2) were affected. This is not a documented or specified feature, but some software relies on it. At least the FreeSWITCH telephony software uses this behavior on PF_INET/SOCK_STREAM.
Fixes: 5bba2728079ed4da33f727dbc2b6ae1de02ba897
show more ...
|
#
3eeb22cb |
| 10-Feb-2024 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
tcp: clean scoreboard when releasing the socket buffer
The SACK scoreboard is conceptually an extention of the socket buffer. Remove it when the socket buffer goes away with soisdisconnected(). Veri
tcp: clean scoreboard when releasing the socket buffer
The SACK scoreboard is conceptually an extention of the socket buffer. Remove it when the socket buffer goes away with soisdisconnected(). Verify that this is also the expected state in tcp_discardcb().
PR: 276761 Reviewed by: glebius, tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43805
show more ...
|
#
ce69e373 |
| 03-Feb-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Revert "sockets: retire sorflush()"
Provide a comment in sorflush() why the socket I/O sx(9) lock is actually important.
This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.
|
#
507f87a7 |
| 16-Jan-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: retire sorflush()
With removal of dom_dispose method the function boils down to two meaningful function calls: socantrcvmore() and sbrelease(). The latter is only relevant for protocols th
sockets: retire sorflush()
With removal of dom_dispose method the function boils down to two meaningful function calls: socantrcvmore() and sbrelease(). The latter is only relevant for protocols that use generic socket buffers.
The socket I/O sx(9) lock acquisition in sorflush() is not relevant for shutdown(2) operation as it doesn't do any I/O that may interleave with read(2) or write(2). The socket buffer mutex acquisition inside sbrelease() is what guarantees thread safety. This sx(9) acquisition in soshutdown() can be tracked down to 4.4BSD times, where it used to be sblock(), and it was carried over through the years evolving together with sockets with no reconsideration of why do we carry it over. I can't tell if that sblock() made sense back then, but it doesn't make any today.
Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43415
show more ...
|
#
5bba2728 |
| 16-Jan-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: make pr_shutdown fully protocol specific method
Disassemble a one-for-all soshutdown() into protocol specific methods. This creates a small amount of copy & paste, but makes code a lot more
sockets: make pr_shutdown fully protocol specific method
Disassemble a one-for-all soshutdown() into protocol specific methods. This creates a small amount of copy & paste, but makes code a lot more self documented, as protocol specific method would execute only the code that is relevant to that protocol and nothing else. This also fixes a couple recent regressions and reduces risk of future regressions. The extended KPI for the new pr_shutdown removes need for the extra pr_flush which was added for the sake of SCTP which could not perform its shutdown properly with the old one. Particularly for SCTP this change streamlines a lot of code.
Some notes on why certain parts of code were copied or were not to certain protocols: * The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is needed only for those protocols that may be connected or disconnected. * The above reduces into only SS_ISCONNECTED for those protocols that always connect instantly. * The ENOTCONN and continue processing hack is left only for datagram protocols. * The SOLISTENING(so) block is copied to those protocols that listen(2). * sorflush() on SHUT_RD is copied almost to every protocol, but that will be refactored later. * wakeup(&so->so_timeo) is copied to protocols that can make a non-instant connect(2), can SO_LINGER or can accept(2).
There are three protocols (netgraph(4), Bluetooth, SDP) that did not have pr_shutdown, but old soshutdown() would still perform sorflush() on SHUT_RD for them and also wakeup(9). Those protocols partially supported shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay about that and SDP is almost abandoned anyway.
Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43413
show more ...
|
#
a13039e2 |
| 27-Dec-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: reoder inpcb destruction
First, merge in_pcbdetach() with in_pcbfree(). The comment for in_pcbdetach() was no longer correct. Then, make sure we remove the inpcb from the hash before we com
inpcb: reoder inpcb destruction
First, merge in_pcbdetach() with in_pcbfree(). The comment for in_pcbdetach() was no longer correct. Then, make sure we remove the inpcb from the hash before we commit any destructive actions on it. There are couple functions that rely on the hash lock skipping SMR + inpcb lock to lookup an inpcb. Although there are no known functions that similarly rely on the global inpcb list lock, also do list removal before destructive actions.
PR: 273890 Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D43122
show more ...
|
#
d2ef52ef |
| 04-Dec-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp/hpts: make stacks responsible for clearing themselves out HPTS
There already is the tfb_tcp_timer_stop_all method that is supposed to stop all time events associated with a given tcpcb by given
tcp/hpts: make stacks responsible for clearing themselves out HPTS
There already is the tfb_tcp_timer_stop_all method that is supposed to stop all time events associated with a given tcpcb by given stack. Some time ago it was doing actual callout_stop(). Today bbr/rack just mark their internal state as inactive in their tfb_tcp_timer_stop_all methods, but tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the methods to also call tcp_hpts_remove(). Note: I'm not sure if internal flag is still relevant once we are out of HPTS wheel.
Call the method when connection goes into TCP_CLOSED state, instead of calling it later when tcpcb is freed. Also call it when we switch between stacks.
Reviewed by: tuexen, rrs Differential Revision: https://reviews.freebsd.org/D42857
show more ...
|