#
5f539170 |
| 07-Mar-2025 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: retire two-level port hash database
This structure originates from the pre-FreeBSD times when system RAM was measured in single digits of MB and Internet speeds were measured in Kb. At first
inpcb: retire two-level port hash database
This structure originates from the pre-FreeBSD times when system RAM was measured in single digits of MB and Internet speeds were measured in Kb. At first level the database hashes the port value only to calculate index into array of pointers to lazily allocated headers that hold lists of inpcbs with the same local port. This design apparently was made to preserve kernel memory.
In the modern kernel size of the first level of the hash is derived from maxsockets, which is derived from maxfiles, which in its turn is derived from amount of physical memory. Then the size of the hash is capped by IPPORT_MAX, cause it doesn't make any sense to have hash table larger then the set of possible values. In practice this cap works even on my laptop. I haven't done precise calculation or experiments, but my guess is that any system with > 8 Gb of RAM will be autotuned to IPPORT_MAX sized hash. Apparently, this hash is a degenerate one: it never has more than one entries in any slot. You can check this with kgdb:
set $i = 0 while ($i <= tcbinfo->ipi_porthashmask) set $p = tcbinfo->ipi_porthashbase[$i].clh_first set $c = 0 while ($p != 0) set $c = $c + 1 set $p = $p->phd_hash.cle_next end if ($c > 1) printf "Slot %u count %u", $i, $c end set $i = $i + 1 end
Retiring the two level hash we remove a lot of complexity at the cost of only one comparison 'inp->inp_lport != lport' in the lookup cycle, which is going to be always false on most machines anyway. This comparison definitely shall be cheaper than extra pointer traversal.
Another positive change to be singled out is that now we no longer need to allocate memory in non-sleepable context in in_pcbinshash(), so a potential ENOMEM on connect(2) is removed.
Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D49151
show more ...
|
Revision tags: release/13.5.0, release/14.2.0-p2, release/14.1.0-p8, release/13.4.0-p4 |
|
#
da806e8d |
| 06-Feb-2025 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Add FIB-aware inpcb lookup
Allow protocol layers to look up an inpcb belonging to a particular FIB. This is indicated by setting INPLOOKUP_FIB; if it is set, the FIB to be used is obtained fr
inpcb: Add FIB-aware inpcb lookup
Allow protocol layers to look up an inpcb belonging to a particular FIB. This is indicated by setting INPLOOKUP_FIB; if it is set, the FIB to be used is obtained from the specificed mbuf or ifnet.
No functional change intended.
Reviewed by: glebius, melifaro MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D48662
show more ...
|
#
bbd0084b |
| 06-Feb-2025 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Add a flags parameter to in_pcbbind()
Add a flag, INPBIND_FIB, which means that the inpcb is local to its FIB number. When this flag is specified, duplicate bindings are permitted, so long a
inpcb: Add a flags parameter to in_pcbbind()
Add a flag, INPBIND_FIB, which means that the inpcb is local to its FIB number. When this flag is specified, duplicate bindings are permitted, so long as each FIB contains at most one inpcb bound to the same address/port. If an inpcb is bound with this flag, it'll have the INP_BOUNDFIB flag set.
No functional change intended.
Reviewed by: glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D48661
show more ...
|
#
9a413162 |
| 06-Feb-2025 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Imbue in(6)_pcblookup_local() with a FIB parameter
This is to enable a mode where duplicate inpcb bindings are permitted, and we want to look up an inpcb with a particular FIB. Thus, add a "
inpcb: Imbue in(6)_pcblookup_local() with a FIB parameter
This is to enable a mode where duplicate inpcb bindings are permitted, and we want to look up an inpcb with a particular FIB. Thus, add a "fib" parameter to in_pcblookup() and related functions, and plumb it through.
A fib value of RT_ALL_FIBS indicates that the lookup should ignore FIB numbers when searching. Otherwise, it should refer to a valid FIB number, and the returned inpcb should belong to the specific FIB. For now, just add the fib parameter where needed, as there are several layers to plumb through.
No functional change intended.
Reviewed by: glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D48660
show more ...
|
Revision tags: release/14.1.0-p7, release/14.2.0-p1, release/13.4.0-p3 |
|
#
c9756953 |
| 23-Dec-2024 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Further restrict binding to a port owned by a different UID
See commit 4f02a7d739b3 for more background.
I cannot see a good reason to continue ignoring mismatching UIDs when binding to INAD
inpcb: Further restrict binding to a port owned by a different UID
See commit 4f02a7d739b3 for more background.
I cannot see a good reason to continue ignoring mismatching UIDs when binding to INADDR_ANY. Looking at the sdr.V2.4a7n sources (mentioned in bugzilla PR 7713), there is a CANT_MCAST_BIND hack wherein the application binds to INADDR_ANY instead of a multicast address, but CANT_MCAST_BIND isn't defined for FreeBSD builds.
It seems unlikely that we still have a use-case for allowing sockets from different UIDs to bind to the same port when binding to the unspecified address. And, as noted in D47832, applications like sdr would have been broken by the inverted SO_REUSEPORT check removed in that revision, apparently without any bug reports. Let's break compatibility and simply disallow this case outright.
Also, add some comments, remove a hack in a regression test which tests this funtionality, and add a new regression test to exercise the remaining checks that were added in commit 4658dc8325e03.
MFC after: 1 month Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D47870
show more ...
|
#
4f02a7d7 |
| 12-Dec-2024 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Remove bogus SO_REUSEPORT(_LB) checks in in_pcbbind()
This check for SO_REUSEPORT was added way back in commit 52b65dbe85faf. Per the commit log, this commit restricted this port-stealing che
inpcb: Remove bogus SO_REUSEPORT(_LB) checks in in_pcbbind()
This check for SO_REUSEPORT was added way back in commit 52b65dbe85faf. Per the commit log, this commit restricted this port-stealing check to unicast addresses, and then only if the existing socket does not have SO_REUSEPORT set. In other words, if there exists a socket bound to INADDR_ANY, and we bind a socket to INADDR_ANY with the same port, then the two sockets need not be owned by the same user if the existing socket has SO_REUSEPORT set.
This is a surprising semantic; bugzilla PR 7713 gives some additional context. That PR makes a case for the behaviour described above when binding to a multicast address. But, the SO_REUSEPORT check is only applied when binding to a non-multicast address, so it doesn't really make sense. In the PR the committer notes that "unicast applications don't set SO_REUSEPORT", which makes some sense, but also refers to "multicast applications that bind to INADDR_ANY", which sounds a bit suspicious.
OpenBSD performs the multicast check, but not the SO_REUSEPORT check. DragonflyBSD removed the SO_REUSEPORT (and INADDR_ANY) checks back in 2014 (commit 0323d5fde12a4). NetBSD explicitly copied our logic and still has it.
The plot thickens: 20 years later, SO_REUSEPORT_LB was ported from DragonflyBSD: this option provides similar semantics to SO_REUSEPORT, but for unicast addresses it causes incoming connections/datagrams to be distributed among all sockets in the group. This commit (1a43cff92a20d) inverted the check for SO_REUSEPORT while adding one for SO_REUSEPORT_LB; this appears to have been inadvertent. However: - apparently no one has noticed that the semantics were changed; - sockets belonging to different users can now be bound to the same port so long as they belong to a single lbgroup bound to INADDR_ANY, which is not correct.
Simply remove the SO_REUSEPORT(_LB) checks, as their original justification was dubious and their current implementation is wrong; add some tests.
Reviewed by: glebius MFC after: 1 month Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D47832
show more ...
|
#
a600aabe |
| 12-Dec-2024 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Close some SO_REUSEPORT_LB races
For a long time, the inpcb lookup path has been lockless in the common case: we use net_epoch to synchronize lookups. However, the routines which update lbgr
inpcb: Close some SO_REUSEPORT_LB races
For a long time, the inpcb lookup path has been lockless in the common case: we use net_epoch to synchronize lookups. However, the routines which update lbgroups were not careful to synchronize with unlocked lookups. I believe that in the worst case this can result in spurious connection aborts (I have a regression test case to exercise this), but it's hard to be certain.
Modify in_pcblbgroup* routines to synchronize with unlocked lookup: - When removing inpcbs from an lbgroup, do not shrink the array. The maximum number of lbgroup entries is INPCBLBGROUP_SIZMAX (256), and it doesn't seem worth the complexity to shrink the array when a socket is removed. - When resizing an lbgroup, do not insert it into the hash table until it is fully initialized; otherwise lookups may observe a partially constructed lbgroup. - When adding an inpcb to the group, increment the counter after adding the array entry, using a release store. Otherwise it's possible for lookups to observe a null array slot. - When looking up an entry, use a corresponding acquire load.
Reviewed by: ae, glebius MFC after: 1 month Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D48020
show more ...
|
#
ffb3d384 |
| 05-Dec-2024 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Fix the GENERIC-NODEBUG build
Fixes: 01f8ce83242d ("inpcb: Factor out parts of in6_pcbbind() and in_pcbbind_setup()")
|
#
01f8ce83 |
| 05-Dec-2024 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Factor out parts of in6_pcbbind() and in_pcbbind_setup()
A large portion of these functions just determines whether the inpcb can bind to the address/port. This portion has no side effects,
inpcb: Factor out parts of in6_pcbbind() and in_pcbbind_setup()
A large portion of these functions just determines whether the inpcb can bind to the address/port. This portion has no side effects, so is a good candidate to move into its own helper function. This patch does so, making the callers less complicated and reducing indentation.
While moving this code, also make some changes: - Load socket options (SO_REUSEADDR etc.) only once. There is nothing preventing another thread from toggling the socket options, so make this function easier to reason about by avoiding races. - When checking whether the bind address is an interface address, make a separate sockaddr rather than temporarily modifying the one passed to in_pcbbind().
Reviewed by: ae, glebius MFC after: 1 month Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D47590
show more ...
|
Revision tags: release/14.2.0 |
|
#
52ef944b |
| 14-Nov-2024 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Constify address parameters to in6 pcb lookup routines
No functional change intended.
MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Stormshield
|
#
45a77bf2 |
| 14-Nov-2024 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Make some cosmetic improvements to in_pcbbind()
- Use the local var "laddr" instead of sin->sin_addr in one block. - Use in_nullhost() instead of explicit comparisons with INADDR_ANY. - Combi
inpcb: Make some cosmetic improvements to in_pcbbind()
- Use the local var "laddr" instead of sin->sin_addr in one block. - Use in_nullhost() instead of explicit comparisons with INADDR_ANY. - Combine multiple socket options checks into one. - Fix indentation. - Remove some unhelpful comments.
This is in preparation for some simplification and bug-fixing.
No functional change intended.
Reviewed by: glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D47451
show more ...
|
#
21d7ac8c |
| 08-Nov-2024 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Remove some unused parameters in internal hash lookup functions
in_pcblookup_hash_wild_* looks up unconnected inpcbs, so there is no point in passing the foreign address and port, and indeed
inpcb: Remove some unused parameters in internal hash lookup functions
in_pcblookup_hash_wild_* looks up unconnected inpcbs, so there is no point in passing the foreign address and port, and indeed those parameters are not used. So, remove them.
No functional change intended.
MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D47385
show more ...
|
Revision tags: release/13.4.0 |
|
#
0c605af3 |
| 29-Aug-2024 |
Mark Johnston <markj@FreeBSD.org> |
netinet: Explicitly disallow connections to the unspecified address
If the V_connect_ifaddr_wild sysctl says that we shouldn't infer a destination address, return an error. Otherwise it's possible
netinet: Explicitly disallow connections to the unspecified address
If the V_connect_ifaddr_wild sysctl says that we shouldn't infer a destination address, return an error. Otherwise it's possible for use of an unspecified foreign address to trigger a subsequent assertion failure, for example in in_pcblookup_hash_locked().
Similarly, if no interface addresses are assigned, fail quickly upon an attempt to connect to the unspecified address.
Reported by: Shawn Webb <shawn.webb@hardenedbsd.org> MFC after: 2 weeks Reviewed by: zlei, allanjude, emaste Differential Revision: https://reviews.freebsd.org/D46454
show more ...
|
#
417b35a9 |
| 20-Aug-2024 |
Mark Johnston <markj@FreeBSD.org> |
netinet: Add a sysctl to allow disabling connections to INADDR_ANY
See the discussion in Bugzilla PR 280705 for context.
PR: 280705 MFC after: 1 week Differential Revision: https://reviews.freebsd
netinet: Add a sysctl to allow disabling connections to INADDR_ANY
See the discussion in Bugzilla PR 280705 for context.
PR: 280705 MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D46259
show more ...
|
Revision tags: release/14.1.0 |
|
#
042fb58d |
| 12-Apr-2024 |
Lexi Winter <lexi@le-Fay.ORG> |
sys/netinet6/in6_pcb.c: fix compile without INET
in6_mapped_sockaddr() and in6_mapped_peeraddr() both define a local variable named 'inp', but in the non-INET case, this variable is set and never us
sys/netinet6/in6_pcb.c: fix compile without INET
in6_mapped_sockaddr() and in6_mapped_peeraddr() both define a local variable named 'inp', but in the non-INET case, this variable is set and never used, causing a compiler error:
/src/freebsd/src/lf/sys/netinet6/in6_pcb.c:547:16: error: variable 'inp' set but not used [-Werror,-Wunused-but-set-variable] 547 | struct inpcb *inp; | ^ /src/freebsd/src/lf/sys/netinet6/in6_pcb.c:573:16: error: variable 'inp' set but not used [-Werror,-Wunused-but-set-variable] 573 | struct inpcb *inp;
Fix this by guarding all the INET-specific logic, including the variable definition, behind #ifdef INET.
While here, tweak formatting in in6_mapped_peeraddr() so both functions are the same.
Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/1155
show more ...
|
Revision tags: release/13.3.0 |
|
#
0fac350c |
| 30-Nov-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: don't malloc/free sockaddr memory on getpeername/getsockname
Just like it was done for accept(2) in cfb1e92912b4, use same approach for two simplier syscalls that return socket addresses.
sockets: don't malloc/free sockaddr memory on getpeername/getsockname
Just like it was done for accept(2) in cfb1e92912b4, use same approach for two simplier syscalls that return socket addresses. Although, these two syscalls aren't performance critical, this change generalizes some code between 3 syscalls trimming code size.
Following example of accept(2), provide VNET-aware and INVARIANT-checking wrappers sopeeraddr() and sosockaddr() around protosw methods.
Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D42694
show more ...
|
#
cfb1e929 |
| 30-Nov-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: don't malloc/free sockaddr memory on accept(2)
Let the accept functions provide stack memory for protocols to fill it in. Generic code should provide sockaddr_storage, specialized code may
sockets: don't malloc/free sockaddr memory on accept(2)
Let the accept functions provide stack memory for protocols to fill it in. Generic code should provide sockaddr_storage, specialized code may provide smaller structure.
While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting required length in case if provided length was insufficient. Our manual page accept(2) and POSIX don't explicitly require that, but one can read the text as they do. Linux also does that. Update tests accordingly.
Reviewed by: rscheff, tuexen, zlei, dchagin Differential Revision: https://reviews.freebsd.org/D42635
show more ...
|
#
29363fb4 |
| 23-Nov-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove ancient SCCS tags.
Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl s
sys: Remove ancient SCCS tags.
Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script.
Sponsored by: Netflix
show more ...
|
Revision tags: release/14.0.0 |
|
#
0bf5377b |
| 14-Sep-2023 |
Andrey V. Elsukov <ae@FreeBSD.org> |
Avoid IPv6 source address selection on accepting TCP connections
When an application listens IPv6 TCP socket, due to ipfw forwarding tag it may handle connections for addresses that do not belongs t
Avoid IPv6 source address selection on accepting TCP connections
When an application listens IPv6 TCP socket, due to ipfw forwarding tag it may handle connections for addresses that do not belongs to the jail or even current host (transparent proxy). Syncache code can successfully handle TCP handshake for such connections. When syncache finally accepts connection it uses in6_pcbconnect() to properly initlize new connection info.
For IPv4 this scenario just works, but for IPv6 it fails when local address doesn't belongs to the jail. This check occurs when in6_pcbladdr() applies IPv6 SAS algorithm. We need IPv6 SAS when we are connection initiator, but in the above case connection is already established and both source and destination addresses are known.
Use unused argument to notify in6_pcbconnect() when we don't need source address selection. This will fix `ipfw fwd` to jailed IPv6 address.
When we are connection initiator, we stil use IPv6 SAS algorithm and apply all related restrictions.
MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D41685
show more ...
|
#
685dc743 |
| 16-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove $FreeBSD$: one-line .c pattern
Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
|
#
e3ba0d6a |
| 27-Jul-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: do not copy so_options into inp_flags2
Since f71cb9f74808 socket stays connnected with inpcb through latter's lifetime and there is no reason to complicate things and copy these flags.
Revie
inpcb: do not copy so_options into inp_flags2
Since f71cb9f74808 socket stays connnected with inpcb through latter's lifetime and there is no reason to complicate things and copy these flags.
Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D41198
show more ...
|
#
a306ed50 |
| 30-May-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Restore missing validation of local addresses for jailed sockets
When looking up a listening socket, the SMR-protected lookup routine may return a jailed socket with no local address. This h
inpcb: Restore missing validation of local addresses for jailed sockets
When looking up a listening socket, the SMR-protected lookup routine may return a jailed socket with no local address. This happens when using classic jails with more than one IP address; in a single-IP classic jail, a bound socket's local address is always rewritten to be that of the jail.
After commit 7b92493ab1d4, the lookup path failed to check whether the jail corresponding to a matched wildcard socket actually owns the address, and would return the match regardless. Restore the omitted checks.
Fixes: 7b92493ab1d4 ("inpcb: Avoid inp_cred dereferences in SMR-protected lookup") Reported by: peter Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D40268
show more ...
|
#
7b92493a |
| 20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether a matching inpcb belongs to a jail, in order to prioritize jailed b
inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether a matching inpcb belongs to a jail, in order to prioritize jailed bound sockets. To do this it has to maintain a ucred reference, and for this to be safe, the reference can't be released until the UMA destructor is called, and this will not happen within any bounded time period.
Changing SMR to periodically recycle garbage is not trivial. Instead, let's implement SMR-synchronized lookup without needing to dereference inp_cred. This will allow the inpcb code to free the inp_cred reference immediately when a PCB is freed, ensuring that ucred (and thus jail) references are released promptly.
Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup") gets us part of the way there. This patch goes further to handle lookups of unconnected sockets. Here, the strategy is to maintain a well-defined order of items within a hash chain so that a wild lookup can simply return the first match and preserve existing semantics. This makes insertion of listening sockets more complicated in order to make lookup simpler, which seems like the right tradeoff anyway given that bind() is already a fairly expensive operation and lookups are more common.
In particular, when inserting an unconnected socket, in_pcbinhash() now keeps the following ordering: - jailed sockets before non-jailed sockets, - specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash lookups. When a match is found, we try to lock the inpcb and re-validate its connection info. In the common case, this works well and we can simply return the inpcb. If this fails, typically because something is concurrently modifying the inpcb, we go to the slow path, which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential reference is formally synchronized by net_epoch, not SMR. In particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now, but I didn't do it in this patch.
Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38572
show more ...
|
#
3e98dcb3 |
| 20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Move inpcb matching logic into separate functions
These functions will get some additional callers in future revisions.
No functional change intended.
Discussed with: glebius Tested by: gle
inpcb: Move inpcb matching logic into separate functions
These functions will get some additional callers in future revisions.
No functional change intended.
Discussed with: glebius Tested by: glebius Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38571
show more ...
|
#
fdb987be |
| 20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and bound PCBs. Since we started using net_epoch to synchronize hash table lookups, there's been a b
inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and bound PCBs. Since we started using net_epoch to synchronize hash table lookups, there's been a bug, noted in a comment above in_pcbrehash(): connecting a socket can cause an inpcb to move between hash chains, and this can cause a concurrent lookup to follow the wrong linkage pointers. I believe this could cause rare, spurious ECONNREFUSED errors in the worse case.
Address the problem by introducing a second hash table and adding more linkage pointers to struct inpcb. Now the database has one table each for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at the foreign address of the inpcb to figure out which table to use. This ensures that queue linkage pointers are stable until the socket is disconnected, so the problem described above goes away. There is also a small benefit in that in_pcblookup_*() can now search just one of the two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This parameter seems confusing and it is simpler to let the inpcb code figure out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and disconnected multiple times during their lifecycle. To handle this, the patch plugs a hole in the inpcb structure and uses it to store an SMR sequence number. When an inpcb is disconnected - an operation which requires the global PCB database hash lock - the write sequence number is advanced, and in order to reconnect, the connecting thread must wait for readers to drain before reusing the inpcb's hash chain linkage pointers.
raw_ip (ab)uses the hash table without using the corresponding accessors. Since there are now two hash tables, it arbitrarily uses the "connected" table for all of its PCBs. This will be addressed in some way in the future.
inp interators which specify a hash bucket will only visit connected PCBs. This is not really correct, but nothing in the tree uses that functionality except raw_ip, which as mentioned above places all of its PCBs in the "connected" table and so is unaffected.
Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38569
show more ...
|