#
7cbb6b6e |
| 23-Jan-2025 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Close some SO_REUSEPORT_LB races, part 2
Suppose a thread is adds a socket to an existing TCP lbgroup that is actively accepting connections. It has to do the following operations: 1. set SO
inpcb: Close some SO_REUSEPORT_LB races, part 2
Suppose a thread is adds a socket to an existing TCP lbgroup that is actively accepting connections. It has to do the following operations: 1. set SO_REUSEPORT_LB on the socket 2. bind() the socket to the shared address/port 3. call listen()
Step 2 makes the inpcb visible to incoming connection requests. However, at this point the inpcb cannot accept new connections. If in_pcblookup() matches it, the remote end will see ECONNREFUSED even when other listening sockets are present in the lbgroup. This means that dynamically adding inpcbs to an lbgroup (e.g., by starting up new workers) can trigger spurious connection failures for no good reason. (A similar problem exists when removing inpcbs from an lbgroup, but that is harder to fix and is not addressed by this patch; see the review for a bit more commentary.)
Fix this by augmenting each lbgroup with a linked list of inpcbs that are pending a listen() call. When adding an inpcb to an lbgroup, keep the inpcb on this list if listen() hasn't been called, so it is not yet visible to the lookup path. Then, add a new in_pcblisten() routine which makes the inpcb visible within the lbgroup now that it's safe to let it handle new connections.
Add a regression test which verifies that we don't get spurious connection errors while adding sockets to an LB group.
Reviewed by: glebius MFC after: 1 month Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D48544
show more ...
|
Revision tags: release/14.2.0 |
|
#
0b4539ee |
| 14-Nov-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: gc unused argument of in_pcbconnect()
|
Revision tags: release/13.4.0, release/14.1.0 |
|
#
1a8d1764 |
| 29-Mar-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: fully retire inp_ppcb pointer
Before a protocol specific control block started to embed inpcb in self (see 0aa120d52f3c, e68b3792440c, 483fe96511ec) this pointer used to point at it.
Retain
inpcb: fully retire inp_ppcb pointer
Before a protocol specific control block started to embed inpcb in self (see 0aa120d52f3c, e68b3792440c, 483fe96511ec) this pointer used to point at it.
Retain kf_sock_inpcb field in the struct kinfo_file in <sys/user.h>. The exp-run detected a minimal use of the field in ports: * sysutils/lsof - patched upstream * net-mgmt/netdata - patch accepted upstream * emulators/qemu-user-static - upstream master branch seems not using the field anymore We can keep the field around for some time, but eventually it may be reused for something else.
PR: 277659 (exp-run) Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D44491
show more ...
|
#
027fda80 |
| 18-Mar-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: remove unused KPIs to manipulate inpcbs
These KPIs were added in 9d29c635daa69 and through 15 years had zero use. They slightly remind what IfAPI does for struct ifnet. But IfAPI does that f
inpcb: remove unused KPIs to manipulate inpcbs
These KPIs were added in 9d29c635daa69 and through 15 years had zero use. They slightly remind what IfAPI does for struct ifnet. But IfAPI does that for the sake of large collection of NIC drivers not being aware of struct ifnet. For the inpcb it is unclear what could be a large collection of externally written kernel modules that need extensively use inpcb and not be aware of its internals at the same time. This isolation of a structure knowledge requires a lot of work, and just throwing in a few KPIs isn't helpful.
Reviewed by: kib, bz, markj Differential Revision: https://reviews.freebsd.org/D44310
show more ...
|
Revision tags: release/13.3.0 |
|
#
a13039e2 |
| 27-Dec-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: reoder inpcb destruction
First, merge in_pcbdetach() with in_pcbfree(). The comment for in_pcbdetach() was no longer correct. Then, make sure we remove the inpcb from the hash before we com
inpcb: reoder inpcb destruction
First, merge in_pcbdetach() with in_pcbfree(). The comment for in_pcbdetach() was no longer correct. Then, make sure we remove the inpcb from the hash before we commit any destructive actions on it. There are couple functions that rely on the hash lock skipping SMR + inpcb lock to lookup an inpcb. Although there are no known functions that similarly rely on the global inpcb list lock, also do list removal before destructive actions.
PR: 273890 Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D43122
show more ...
|
#
0fac350c |
| 30-Nov-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: don't malloc/free sockaddr memory on getpeername/getsockname
Just like it was done for accept(2) in cfb1e92912b4, use same approach for two simplier syscalls that return socket addresses.
sockets: don't malloc/free sockaddr memory on getpeername/getsockname
Just like it was done for accept(2) in cfb1e92912b4, use same approach for two simplier syscalls that return socket addresses. Although, these two syscalls aren't performance critical, this change generalizes some code between 3 syscalls trimming code size.
Following example of accept(2), provide VNET-aware and INVARIANT-checking wrappers sopeeraddr() and sosockaddr() around protosw methods.
Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D42694
show more ...
|
#
29363fb4 |
| 23-Nov-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove ancient SCCS tags.
Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl s
sys: Remove ancient SCCS tags.
Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script.
Sponsored by: Netflix
show more ...
|
#
bbbd7aab |
| 20-Nov-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: garbage collect in_pcbnotifyall()
|
Revision tags: release/14.0.0 |
|
#
2ff63af9 |
| 16-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove $FreeBSD$: one-line .h pattern
Remove /^\s*\*+\s*\$FreeBSD\$.*$\n/
|
#
e3ba0d6a |
| 27-Jul-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: do not copy so_options into inp_flags2
Since f71cb9f74808 socket stays connnected with inpcb through latter's lifetime and there is no reason to complicate things and copy these flags.
Revie
inpcb: do not copy so_options into inp_flags2
Since f71cb9f74808 socket stays connnected with inpcb through latter's lifetime and there is no reason to complicate things and copy these flags.
Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D41198
show more ...
|
#
a43e7a96 |
| 27-Jul-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use internal flag to mark pcbs that are inserted into lbgroup
Using INP_REUSEPORT_LB is unsafe, as it is basically a copy of socket's SO_REUSEPORT_LB flag, which can be cleared by userland af
inpcb: use internal flag to mark pcbs that are inserted into lbgroup
Using INP_REUSEPORT_LB is unsafe, as it is basically a copy of socket's SO_REUSEPORT_LB flag, which can be cleared by userland after bind().
Reviewed by: markj Reported by: syzbot+e7d2e451f89fb444319b@syzkaller.appspotmail.com Differential Revision: https://reviews.freebsd.org/D41197
show more ...
|
#
c3c20de3 |
| 25-Apr-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: move HPTS/LRO flags out of inpcb to tcpcb
These flags are TCP specific. While here, make also several LRO internal functions to pass tcpcb pointer instead of inpcb one.
Reviewed by: rrs Diff
tcp: move HPTS/LRO flags out of inpcb to tcpcb
These flags are TCP specific. While here, make also several LRO internal functions to pass tcpcb pointer instead of inpcb one.
Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39698
show more ...
|
#
c2a69e84 |
| 25-Apr-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: move HPTS related fields from inpcb to tcpcb
This makes inpcb lighter and allows future cache line optimizations of tcpcb. The reason why HPTS originally used inpcb is the compressed TIME
tcp_hpts: move HPTS related fields from inpcb to tcpcb
This makes inpcb lighter and allows future cache line optimizations of tcpcb. The reason why HPTS originally used inpcb is the compressed TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the associated connection is still on the HPTS ring.
Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39697
show more ...
|
#
7b92493a |
| 20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether a matching inpcb belongs to a jail, in order to prioritize jailed b
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether a matching inpcb belongs to a jail, in order to prioritize jailed bound sockets. To do this it has to maintain a ucred reference, and for this to be safe, the reference can't be released until the UMA destructor is called, and this will not happen within any bounded time period.
Changing SMR to periodically recycle garbage is not trivial. Instead, let's implement SMR-synchronized lookup without needing to dereference inp_cred. This will allow the inpcb code to free the inp_cred reference immediately when a PCB is freed, ensuring that ucred (and thus jail) references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup") gets us part of the way there. This patch goes further to handle lookups of unconnected sockets. Here, the strategy is to maintain a well-defined order of items within a hash chain so that a wild lookup can simply return the first match and preserve existing semantics. This makes insertion of listening sockets more complicated in order to make lookup simpler, which seems like the right tradeoff anyway given that bind() is already a fairly expensive operation and lookups are more common.
In particular, when inserting an unconnected socket, in_pcbinhash() now keeps the following ordering: - jailed sockets before non-jailed sockets, - specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash lookups. When a match is found, we try to lock the inpcb and re-validate its connection info. In the common case, this works well and we can simply return the inpcb. If this fails, typically because something is concurrently modifying the inpcb, we go to the slow path, which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential reference is formally synchronized by net_epoch, not SMR. In particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now, but I didn't do it in this patch.
Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38572
show more ...
|
#
fdb987be |
| 20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and bound PCBs. Since we started using net_epoch to synchronize hash table lookups, there's been a b
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and bound PCBs. Since we started using net_epoch to synchronize hash table lookups, there's been a bug, noted in a comment above in_pcbrehash(): connecting a socket can cause an inpcb to move between hash chains, and this can cause a concurrent lookup to follow the wrong linkage pointers. I believe this could cause rare, spurious ECONNREFUSED errors in the worse case.
Address the problem by introducing a second hash table and adding more linkage pointers to struct inpcb. Now the database has one table each for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at the foreign address of the inpcb to figure out which table to use. This ensures that queue linkage pointers are stable until the socket is disconnected, so the problem described above goes away. There is also a small benefit in that in_pcblookup_*() can now search just one of the two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This parameter seems confusing and it is simpler to let the inpcb code figure out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and disconnected multiple times during their lifecycle. To handle this, the patch plugs a hole in the inpcb structure and uses it to store an SMR sequence number. When an inpcb is disconnected - an operation which requires the global PCB database hash lock - the write sequence number is advanced, and in order to reconnect, the connecting thread must wait for readers to drain before reusing the inpcb's hash chain linkage pointers.
raw_ip (ab)uses the hash table without using the corresponding accessors. Since there are now two hash tables, it arbitrarily uses the "connected" table for all of its PCBs. This will be addressed in some way in the future.
inp interators which specify a hash bucket will only visit connected PCBs. This is not really correct, but nothing in the tree uses that functionality except raw_ip, which as mentioned above places all of its PCBs in the "connected" table and so is unaffected.
Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38569
show more ...
|
Revision tags: release/13.2.0 |
|
#
317fa516 |
| 28-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
netinet: Remove the IP(V6)_RSS_LISTEN_BUCKET socket option
It has no effect, and an exp-run revealed that it is not in use.
PR: 261398 (exp-run) Reviewed by: mjg, glebius Sponsored by: Klara, Inc.
netinet: Remove the IP(V6)_RSS_LISTEN_BUCKET socket option
It has no effect, and an exp-run revealed that it is not in use.
PR: 261398 (exp-run) Reviewed by: mjg, glebius Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38822
show more ...
|
#
3aff4ccd |
| 27-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
netinet: Remove IP(V6)_BINDMULTI
This option was added in commit 0a100a6f1ee5 but was never completed. In particular, there is no logic to map flowids to different listening sockets, so it accomplis
netinet: Remove IP(V6)_BINDMULTI
This option was added in commit 0a100a6f1ee5 but was never completed. In particular, there is no logic to map flowids to different listening sockets, so it accomplishes basically the same thing as SO_REUSEPORT. Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to balance among listening sockets using a hash of the 4-tuple and some optional NUMA policy.
The option was never documented or completed, and an exp-run revealed nothing using it in the ports tree. Moreover, it complicates the already very complicated in_pcbbind_setup(), and the checking in in_pcbbind_check_bindmulti() is insufficient. So, let's remove it.
PR: 261398 (exp-run) Reviewed by: glebius Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38574
show more ...
|
#
96871af0 |
| 15-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use family specific sockaddr argument for bind functions
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_bind method and from there on go down the call stack wi
inpcb: use family specific sockaddr argument for bind functions
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_bind method and from there on go down the call stack with family specific argument.
Reviewed by: zlei, melifaro, markj Differential Revision: https://reviews.freebsd.org/D38601
show more ...
|
#
09d3671b |
| 03-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: better document INP_ANONPORT flag
The name is pretty self explaining, but it is unclear why we need this flag, as kernel only sets it and never reads.
|
#
9e46ff4d |
| 03-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
netinet: don't return conflicting inpcb in in_pcbconnect_setup()
Last time this inpcb was actually used was in tcp_connect() before c94c54e4df9a.
|
#
a9d22cce |
| 03-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use family specific sockaddr argument for connect functions
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_connect method and from there on go down the call st
inpcb: use family specific sockaddr argument for connect functions
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_connect method and from there on go down the call stack with family specific argument.
Reviewed by: markj Differential revision: https://reviews.freebsd.org/D38356
show more ...
|
#
0aa120d5 |
| 02-Dec-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: allow to provide protocol specific pcb size
The protocol specific structure shall start with inpcb.
Differential revision: https://reviews.freebsd.org/D37126
|
Revision tags: release/12.4.0 |
|
#
d93ec8cb |
| 02-Nov-2022 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Allow SO_REUSEPORT_LB to be used in jails
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed process. It is trivial to support this option in VNET jails, but it's also usef
inpcb: Allow SO_REUSEPORT_LB to be used in jails
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed process. It is trivial to support this option in VNET jails, but it's also useful in traditional jails.
This patch enables LB groups in jails with the following semantics: - all PCBs in a group must belong to the same jail, - PCB lookup prefers jailed groups to non-jailed groups
This is a straightforward extension of the semantics used for individual listening sockets. One pre-existing quirk of the lbgroup implementation is that non-jailed lbgroups are searched before jailed listening sockets; that is preserved with this change.
Discussed with: glebius MFC after: 1 month Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37029
show more ...
|
#
19acc506 |
| 31-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: retire suppresion of randomization of ephemeral ports
The suppresion was added in 5f311da2ccb with no explanation in the commit message of the exact problem that was fixed. In the BSDCan 2006
inpcb: retire suppresion of randomization of ephemeral ports
The suppresion was added in 5f311da2ccb with no explanation in the commit message of the exact problem that was fixed. In the BSDCan 2006 talk [1], slides 12 to 14, we can find that it seems that there was some problem with the TIME_WAIT state not properly being handled on the remote side (also FreeBSD!), and this switching off the suppression had hidden the problem. The rationale of the change was that other stacks may also be buggy wrt the TIME_WAIT.
I did not find the actual problem in TIME_WAIT that the suppression has hidden, neither a commit that would fix it. However, since that time we started to handle SYNs with RFC5961 instead of RFC793, see 3220a2121cc. We also now have the tcp-testsuite [2], that has full coverage of all possible scenarios of receiving SYN in TIME_WAIT.
This effectively reverts 5f311da2ccb6c216b79049172be840af4778129a and 6ee79c59d2c081f837a11cc78908be54a6dbe09d.
[1] https://www.bsdcan.org/2006/papers/ImprovingTCPIP.pdf [2] https://github.com/freebsd-net/tcp-testsuite
Reviewed by: rscheff Discussed with: rscheff, rrs, tuexen Differential revision: https://reviews.freebsd.org/D37042
show more ...
|
#
24cf7a8d |
| 20-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: provide pcbinfo pointer argument to inp_apply_all()
Allows to clear inpcb layer of TCP knowledge.
|