History log of /freebsd/sys/netinet/tcp_usrreq.c (Results 1 – 25 of 745)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 06bf119f 28-Jan-2025 Gleb Smirnoff <glebius@FreeBSD.org>

sockets/tcp: quick fix for regression with SO_REUSEPORT_LB

There was a long living problem that pr_listen is called every time on
consecutive listen(2) syscalls. Up until today it produces spurious

sockets/tcp: quick fix for regression with SO_REUSEPORT_LB

There was a long living problem that pr_listen is called every time on
consecutive listen(2) syscalls. Up until today it produces spurious TCP
state change events in tracing software and other harmless problems. But
with 7cbb6b6e28db we started to call LIST_REMOVE() twice on the same
entry.

This is quite ugly, but quick and robust fix against regression, that we
decided to put in the scope of the January stabilization week. A better
refactoring will happen later.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D48703
Fixes: 7cbb6b6e28db33095a1cf7a8887921a5ec969824

show more ...


# 7cbb6b6e 23-Jan-2025 Mark Johnston <markj@FreeBSD.org>

inpcb: Close some SO_REUSEPORT_LB races, part 2

Suppose a thread is adds a socket to an existing TCP lbgroup that is
actively accepting connections. It has to do the following operations:
1. set SO

inpcb: Close some SO_REUSEPORT_LB races, part 2

Suppose a thread is adds a socket to an existing TCP lbgroup that is
actively accepting connections. It has to do the following operations:
1. set SO_REUSEPORT_LB on the socket
2. bind() the socket to the shared address/port
3. call listen()

Step 2 makes the inpcb visible to incoming connection requests.
However, at this point the inpcb cannot accept new connections. If
in_pcblookup() matches it, the remote end will see ECONNREFUSED even
when other listening sockets are present in the lbgroup. This means
that dynamically adding inpcbs to an lbgroup (e.g., by starting up new
workers) can trigger spurious connection failures for no good reason.
(A similar problem exists when removing inpcbs from an lbgroup, but that
is harder to fix and is not addressed by this patch; see the review for
a bit more commentary.)

Fix this by augmenting each lbgroup with a linked list of inpcbs that
are pending a listen() call. When adding an inpcb to an lbgroup, keep
the inpcb on this list if listen() hasn't been called, so it is not yet
visible to the lookup path. Then, add a new in_pcblisten() routine which
makes the inpcb visible within the lbgroup now that it's safe to let it
handle new connections.

Add a regression test which verifies that we don't get spurious
connection errors while adding sockets to an LB group.

Reviewed by: glebius
MFC after: 1 month
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48544

show more ...


# 053a9884 23-Dec-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: don't ever return ECONNRESET on close(2)

The SUS doesn't mention this error code as a possible one [1]. The FreeBSD
manual page specifies a possible ECONNRESET for close(2):

[ECONNRESET] The u

tcp: don't ever return ECONNRESET on close(2)

The SUS doesn't mention this error code as a possible one [1]. The FreeBSD
manual page specifies a possible ECONNRESET for close(2):

[ECONNRESET] The underlying object was a stream socket that was
shut down by the peer before all pending data was
delivered.

In the past it had been EINVAL (see 21367f630d72), and this EINVAL was
added as a safety measure in 623dce13c64ef. After conversion to
ECONNRESET it had been documented in the manual page in 78e3a7fdd51e6, but
I bet wasn't ever tested to actually be ever returned, cause the
tcp-testsuite[2] didn't exist back then. So documentation is incorrect
since 2006, if my bet wins. Anyway, in the modern FreeBSD the condition
described above doesn't end up with ECONNRESET error code from close(2).
The error condition is reported via SO_ERROR socket option, though. This
can be checked using the tcp-testsuite, temporarily disabling the
getsockopt(SO_ERROR) lines using sed command [3]. Most of these
getsockopt(2)s are followed by '+0.00 close(3) = 0', which will confirm
that close(2) doesn't return ECONNRESET even on a socket that has the
error stored, neither it is returned in the case described in the manual
page. The latter case is covered by multiple tests residing in tcp-
testsuite/state-event-engine/rcv-rst-*.

However, the deleted block of code could be entered in a race condition
between close(2) and processing of incoming packet, when connection had
already been half-closed with shutdown(SHUT_WR) and sits in TCPS_LAST_ACK.
This was reported in the bug 146845. With the block deleted, we will
continue into tcp_disconnect() which has proper handling of INP_DROPPED.

The race explanation follows. The connection is in TCPS_LAST_ACK. The
network input thread acquires the tcpcb lock first, sets INP_DROPPED,
acquires the socket lock in soisdisconnected() and clears SS_ISCONNECTED.
Meanwhile, the syscall thread goes through sodisconnect() which checks for
SS_ISCONNECTED locklessly(!). The check passes and the thread blocks on
the tcpcb lock in tcp_usr_disconnect(). Once input thread releases the
lock, the syscall thread observes INP_DROPPED and returns ECONNRESET.

- Thread 1: tcp_do_segment()->tcp_close()->in_pcbdrop(),soisdisconnected()
- Thread 2: sys_close()...->soclose()->sodisconnect()->tcp_usr_disconnect()

Note that the lockless operation in sodisconnect() isn't correct, but
enforcing the socket lock there will not fix the problem.

[1] https://pubs.opengroup.org/onlinepubs/9799919799/
[2] https://github.com/freebsd-net/tcp-testsuite
[3] sed -i "" -Ee '/\+0\.00 getsockopt\(3, SOL_SOCKET, SO_ERROR, \[ECONNRESET\]/d' $(grep -lr ECONNRESET tcp-testsuite)

PR: 146845
Reviewed by: tuexen, rrs, imp
Differential Revision: https://reviews.freebsd.org/D48148

show more ...


# c91dd7a0 19-Dec-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove unused variable from tcp_usr_disconnect()


Revision tags: release/14.2.0
# 0b4539ee 14-Nov-2024 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: gc unused argument of in_pcbconnect()


# dded4e9e 14-Nov-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: change SOCKBUF_* macros to SOCK_[RECV|SEND]BUF_* macros

Change the older LOCK related macros over to the
dedicated send/recv buffer macros in the base tcp stack.

No functional change intended.

tcp: change SOCKBUF_* macros to SOCK_[RECV|SEND]BUF_* macros

Change the older LOCK related macros over to the
dedicated send/recv buffer macros in the base tcp stack.

No functional change intended.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47567

show more ...


Revision tags: release/13.4.0
# 093d9b46 05-Aug-2024 Michael Tuexen <tuexen@FreeBSD.org>

ddb: update printing of t_flags and tflags2

Update the ddb printing of t_flags and t_flags2 to the current state of
definitions in tcp_var.h.

Reviewed by: cc
MFC after: 1 week
Sponsored by: Netf

ddb: update printing of t_flags and tflags2

Update the ddb printing of t_flags and t_flags2 to the current state of
definitions in tcp_var.h.

Reviewed by: cc
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D46222

show more ...


# 00d3b744 28-Jul-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp cc: remove non-working sctp support

As suggested by lstewart, remove the non-working SCTP support in the
TCP congestion control modules. SCTP has a similar functionality
(although not using kern

tcp cc: remove non-working sctp support

As suggested by lstewart, remove the non-working SCTP support in the
TCP congestion control modules. SCTP has a similar functionality
(although not using kernel loadable modules), on which the TCP stuff
was built on, but the integration was never done.
No functional change intended.

Reviewed by: Peter Lei, cc
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D46142

show more ...


# baee801c 21-Jul-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: simplify endpoint creation at the passive side

Use the intended TCP stack when creating a TCP endpoint instead of
creating it the endpoint the default stack first and after that
switching it to

tcp: simplify endpoint creation at the passive side

Use the intended TCP stack when creating a TCP endpoint instead of
creating it the endpoint the default stack first and after that
switching it to use the intended TCP stack.
Reviewed by: Peter Lei, rrs and jtl (older version)
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45411

show more ...


# 86c9325d 06-Jun-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: simplify stack switching protocol

Before this patch, a stack (tfb) accepts a tcpcb (tp), if the
tp->t_state is TCPS_CLOSED or tfb->tfb_tcp_handoff_ok is not NULL
and tfb->tfb_tcp_handoff_ok(tp)

tcp: simplify stack switching protocol

Before this patch, a stack (tfb) accepts a tcpcb (tp), if the
tp->t_state is TCPS_CLOSED or tfb->tfb_tcp_handoff_ok is not NULL
and tfb->tfb_tcp_handoff_ok(tp) returns 0.
After this patch, the only check is tfb->tfb_tcp_handoff_ok(tp)
returns 0. tfb->tfb_tcp_handoff_ok must always be provided.
For existing TCP stacks (FreeBSD, RACK and BBR) there is no
functional change. However, the logic is simpler.

Reviewed by: lstewart, peter_lei_ieee_.org, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45253

show more ...


Revision tags: release/14.1.0
# e7381521 30-May-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove unused code in tcp_usr_attach

pr_attach is only called on a socket (so) with so->so_listen != NULL
via sonewconn. However, sonewconn is not called from the TCP code.
The listening socket

tcp: remove unused code in tcp_usr_attach

pr_attach is only called on a socket (so) with so->so_listen != NULL
via sonewconn. However, sonewconn is not called from the TCP code.
The listening sockets are handled in tcp_syncache.c without using
sonewconn. Therefore, the code removed is never executed.
No functional change intended.

Reviewed by: rrs, peter.lei_ieee.org
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45412

show more ...


# fe136aec 23-May-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve inp locking in setsockopt

Ensure that the inp is not dropped when starting a stack switch.
While there, clean-up the code by using INP_WLOCK_RECHECK, which
also re-assigns tp.

Reviewed

tcp: improve inp locking in setsockopt

Ensure that the inp is not dropped when starting a stack switch.
While there, clean-up the code by using INP_WLOCK_RECHECK, which
also re-assigns tp.

Reviewed by: glebius
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45241

show more ...


# fce03f85 05-May-2024 Randall Stewart <rrs@FreeBSD.org>

TCP can be subject to Sack Attacks lets fix this issue.

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if i

TCP can be subject to Sack Attacks lets fix this issue.

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>

show more ...


# dd7b86e2 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362

show more ...


# 85df11a1 13-Mar-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

ktls: deep copy tls_enable struct for in-kernel tcp consumers

Doing a deep copy of the keys early allows users of the
tls_enable structure to assume kernel memory.
This enables the socket options to

ktls: deep copy tls_enable struct for in-kernel tcp consumers

Doing a deep copy of the keys early allows users of the
tls_enable structure to assume kernel memory.
This enables the socket options to be set by kernel threads.

Reviewed By: #transport, tuexen, jhb, rrs
Sponsored by: NetApp, Inc.
X-NetApp-PR: #79
Differential Revision: https://reviews.freebsd.org/D44250

show more ...


# e18b97bd 12-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal w

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986

show more ...


# c112243f 11-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

Revert "Update to bring the rack stack with all its fixes in."

This commit was incomplete and breaks LINT kernels. The tree has been
broken for 8+ hours.

This reverts commit f6d489f402c320f1a6eaa4

Revert "Update to bring the rack stack with all its fixes in."

This commit was incomplete and breaks LINT kernels. The tree has been
broken for 8+ hours.

This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.

show more ...


# f6d489f4 11-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal w

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

Note there is a new file that I can't figure out how to get in rack_pcm.c

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986

show more ...


Revision tags: release/13.3.0
# abe8379b 15-Feb-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: repair wakeup of accept(2) by shutdown(2)

That was lost in transition from one-for-all soshutdown() to protocol
specific methods. Only protocols that listen(2) were affected. This is
not

sockets: repair wakeup of accept(2) by shutdown(2)

That was lost in transition from one-for-all soshutdown() to protocol
specific methods. Only protocols that listen(2) were affected. This is
not a documented or specified feature, but some software relies on it. At
least the FreeSWITCH telephony software uses this behavior on
PF_INET/SOCK_STREAM.

Fixes: 5bba2728079ed4da33f727dbc2b6ae1de02ba897

show more ...


# 3eeb22cb 10-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: clean scoreboard when releasing the socket buffer

The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Veri

tcp: clean scoreboard when releasing the socket buffer

The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().

PR: 276761
Reviewed by: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43805

show more ...


# ce69e373 03-Feb-2024 Gleb Smirnoff <glebius@FreeBSD.org>

Revert "sockets: retire sorflush()"

Provide a comment in sorflush() why the socket I/O sx(9) lock is actually
important.

This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.


# 507f87a7 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: retire sorflush()

With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease(). The latter is
only relevant for protocols th

sockets: retire sorflush()

With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease(). The latter is
only relevant for protocols that use generic socket buffers.

The socket I/O sx(9) lock acquisition in sorflush() is not relevant for
shutdown(2) operation as it doesn't do any I/O that may interleave with
read(2) or write(2). The socket buffer mutex acquisition inside
sbrelease() is what guarantees thread safety. This sx(9) acquisition in
soshutdown() can be tracked down to 4.4BSD times, where it used to be
sblock(), and it was carried over through the years evolving together with
sockets with no reconsideration of why do we carry it over. I can't tell
if that sblock() made sense back then, but it doesn't make any today.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43415

show more ...


# 5bba2728 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: make pr_shutdown fully protocol specific method

Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more

sockets: make pr_shutdown fully protocol specific method

Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else. This also fixes a
couple recent regressions and reduces risk of future regressions. The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one. Particularly for SCTP this change streamlines
a lot of code.

Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
connect(2), can SO_LINGER or can accept(2).

There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9). Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43413

show more ...


# a13039e2 27-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: reoder inpcb destruction

First, merge in_pcbdetach() with in_pcbfree(). The comment for
in_pcbdetach() was no longer correct. Then, make sure we remove
the inpcb from the hash before we com

inpcb: reoder inpcb destruction

First, merge in_pcbdetach() with in_pcbfree(). The comment for
in_pcbdetach() was no longer correct. Then, make sure we remove
the inpcb from the hash before we commit any destructive actions
on it. There are couple functions that rely on the hash lock
skipping SMR + inpcb lock to lookup an inpcb. Although there are
no known functions that similarly rely on the global inpcb list
lock, also do list removal before destructive actions.

PR: 273890
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D43122

show more ...


# d2ef52ef 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp/hpts: make stacks responsible for clearing themselves out HPTS

There already is the tfb_tcp_timer_stop_all method that is supposed to stop
all time events associated with a given tcpcb by given

tcp/hpts: make stacks responsible for clearing themselves out HPTS

There already is the tfb_tcp_timer_stop_all method that is supposed to stop
all time events associated with a given tcpcb by given stack. Some time
ago it was doing actual callout_stop(). Today bbr/rack just mark their
internal state as inactive in their tfb_tcp_timer_stop_all methods, but
tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the
methods to also call tcp_hpts_remove(). Note: I'm not sure if internal
flag is still relevant once we are out of HPTS wheel.

Call the method when connection goes into TCP_CLOSED state, instead of
calling it later when tcpcb is freed. Also call it when we switch between
stacks.

Reviewed by: tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42857

show more ...


12345678910>>...30