#
0aa120d5 |
| 02-Dec-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: allow to provide protocol specific pcb size
The protocol specific structure shall start with inpcb.
Differential revision: https://reviews.freebsd.org/D37126
|
Revision tags: release/12.4.0 |
|
#
d93ec8cb |
| 02-Nov-2022 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Allow SO_REUSEPORT_LB to be used in jails
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed process. It is trivial to support this option in VNET jails, but it's also usef
inpcb: Allow SO_REUSEPORT_LB to be used in jails
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed process. It is trivial to support this option in VNET jails, but it's also useful in traditional jails.
This patch enables LB groups in jails with the following semantics: - all PCBs in a group must belong to the same jail, - PCB lookup prefers jailed groups to non-jailed groups
This is a straightforward extension of the semantics used for individual listening sockets. One pre-existing quirk of the lbgroup implementation is that non-jailed lbgroups are searched before jailed listening sockets; that is preserved with this change.
Discussed with: glebius MFC after: 1 month Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37029
show more ...
|
#
19acc506 |
| 31-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: retire suppresion of randomization of ephemeral ports
The suppresion was added in 5f311da2ccb with no explanation in the commit message of the exact problem that was fixed. In the BSDCan 2006
inpcb: retire suppresion of randomization of ephemeral ports
The suppresion was added in 5f311da2ccb with no explanation in the commit message of the exact problem that was fixed. In the BSDCan 2006 talk [1], slides 12 to 14, we can find that it seems that there was some problem with the TIME_WAIT state not properly being handled on the remote side (also FreeBSD!), and this switching off the suppression had hidden the problem. The rationale of the change was that other stacks may also be buggy wrt the TIME_WAIT.
I did not find the actual problem in TIME_WAIT that the suppression has hidden, neither a commit that would fix it. However, since that time we started to handle SYNs with RFC5961 instead of RFC793, see 3220a2121cc. We also now have the tcp-testsuite [2], that has full coverage of all possible scenarios of receiving SYN in TIME_WAIT.
This effectively reverts 5f311da2ccb6c216b79049172be840af4778129a and 6ee79c59d2c081f837a11cc78908be54a6dbe09d.
[1] https://www.bsdcan.org/2006/papers/ImprovingTCPIP.pdf [2] https://github.com/freebsd-net/tcp-testsuite
Reviewed by: rscheff Discussed with: rscheff, rrs, tuexen Differential revision: https://reviews.freebsd.org/D37042
show more ...
|
#
24cf7a8d |
| 20-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: provide pcbinfo pointer argument to inp_apply_all()
Allows to clear inpcb layer of TCP knowledge.
|
#
53af6903 |
| 07-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove INP_TIMEWAIT flag
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After 0d7445193ab, this commit shall not cause any functional changes.
Note: this flag was very often check
tcp: remove INP_TIMEWAIT flag
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After 0d7445193ab, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED. If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we will be able to remove most of this checks and turn them to assertions. Some of them can be turned into assertions right now, but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
show more ...
|
#
893f36b7 |
| 04-Sep-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
netinet: Correct a typo in source code comment
- s/occured/occurred/
MFC after: 3 days
|
Revision tags: release/13.1.0 |
|
#
a35bdd44 |
| 09-Feb-2022 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: add sysctl interface for setting socket options
This interface allows to set a socket option on a TCP endpoint, which is specified by its inp_gencnt. This interface will be used in an upcoming
tcp: add sysctl interface for setting socket options
This interface allows to set a socket option on a TCP endpoint, which is specified by its inp_gencnt. This interface will be used in an upcoming command line tool tcpsso.
Reviewed by: glebius, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D34138
show more ...
|
#
afad340a |
| 03-Jan-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: garbage collect INP_LOCK_INIT(), used only once in sctp
Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D33543
|
#
fec8a8c7 |
| 03-Jan-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for a protocol. Initialize it with global protocol init using macro INPCBSTORAGE_DEFINE().
inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for a protocol. Initialize it with global protocol init using macro INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses its private hash, but they all use same zone to allocate and SMR section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit on the socket zone, which was always global. Historically same maxsockets value is applied also to every PCB zone. Important fact: you can't create a pcb without a socket! A pcb may outlive its socket, however. Given that there are multiple protocols, and only one socket zone, the per pcb zone limits seem to have little value. Under very special conditions it may trigger a little bit earlier than socket zone limit, but in most setups the socket zone limit will be triggered earlier. When VIMAGE was added to the kernel PCB zones became per-VNET. This magnified existing disbalance further: now we have multiple pcb zones in multiple vnets limited to maxsockets, but every pcb requires a socket allocated from the global zone also limited by maxsockets. IMHO, this per pcb zone limit doesn't bring any value, so this patch drops it. If anybody explains value of this limit, it can be restored very easy - just 2 lines change to in_pcbstorage_init().
Differential revision: https://reviews.freebsd.org/D33542
show more ...
|
#
a0577692 |
| 26-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
in_pcb: use jenkins hash over the entire IPv6 (or IPv4) address
The intent is to provide more entropy than can be provided by just the 32-bits of the IPv6 address which overlaps with 6to4 tunnels.
in_pcb: use jenkins hash over the entire IPv6 (or IPv4) address
The intent is to provide more entropy than can be provided by just the 32-bits of the IPv6 address which overlaps with 6to4 tunnels. This is needed to mitigate potential algorithmic complexity attacks from attackers who can control large numbers of IPv6 addresses.
Together with: gallatin Reviewed by: dwmalone, rscheff Differential revision: https://reviews.freebsd.org/D33254
show more ...
|
#
a370832b |
| 26-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove delayed drop KPI
No longer needed after tcp_output() can ask caller to drop.
Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33371
|
#
db0ac6de |
| 02-Dec-2021 |
Cy Schubert <cy@FreeBSD.org> |
Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"
This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b.
A mism
Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"
This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b.
A mismerge of a merge to catch up to main resulted in files being committed which should not have been.
show more ...
|
#
266f97b5 |
| 02-Dec-2021 |
Cy Schubert <cy@FreeBSD.org> |
wpa: Import wpa_supplicant/hostapd commit 14ab4a816
This is the November update to vendor/wpa committed upstream 2021-11-26.
MFC after: 1 month
|
#
2e27230f |
| 02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: rewrite inpcb synchronization
Just trust the pcb database, that if we did in_pcbref(), no way an inpcb can go away. And if we never put a dropped inpcb on our queue, and tcp_discardcb() a
tcp_hpts: rewrite inpcb synchronization
Just trust the pcb database, that if we did in_pcbref(), no way an inpcb can go away. And if we never put a dropped inpcb on our queue, and tcp_discardcb() always removes an inpcb to be dropped from the queue, then any inpcb on the queue is valid.
Now, to solve LOR between inpcb lock and HPTS queue lock do the following trick. When we are about to process a certain time slot, take the full queue of the head list into on stack list, drop the HPTS lock and work on our queue. This of course opens a race when an inpcb is being removed from the on stack queue, which was already mentioned in comments. To address this race introduce generation count into queues. If we want to remove an inpcb with generation count mismatch, we can't do that, we can only mark it with desired new time slot or -1 for remove.
Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33026
show more ...
|
#
f971e791 |
| 02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: rename input queue to drop queue and trim dead code
The HPTS input queue is in reality used only for "delayed drops". When a TCP stack decides to drop a connection on the output path it ca
tcp_hpts: rename input queue to drop queue and trim dead code
The HPTS input queue is in reality used only for "delayed drops". When a TCP stack decides to drop a connection on the output path it can't do that due to locking protocol between main tcp_output() and stacks. So, rack/bbr utilize HPTS to drop the connection in a different context.
In the past the queue could also process input packets in context of HPTS thread, but now no stack uses this, so remove this functionality.
Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33025
show more ...
|
#
de2d4784 |
| 02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addre
SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addresses, etc). However, inpcb aren't static in nature, they are created and destroyed all the time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to safely free memory in page-sized batches, with virtually zero overhead compared to uma_zfree(). However, unlike epoch(9), it puts stricter requirement on the access to the protected memory, needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9). - For write access nothing is changed. - For a lookup in the database SMR section is now required. Once the desired inpcb is found we need to transition from SMR section to r/w lock on the inpcb itself, with a check that inpcb isn't yet freed. This requires some compexity, since SMR section itself is a critical(9) section. The complexity is hidden from KPI users in inp_smr_lock(). - For a inpcb list traversal (a pcblist sysctl, or broadcast notification) also a new KPI is provided, that hides internals of the database - inp_next(struct inp_iterator *).
Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33022
show more ...
|
#
565655f4 |
| 02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: reduce some aliased functions after removal of PCBGROUP.
Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33021
|
#
93c67567 |
| 02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Remove "options PCBGROUP"
With upcoming changes to the inpcb synchronisation it is going to be broken. Even its current status after the move of PCB synchronization to the network epoch is very ques
Remove "options PCBGROUP"
With upcoming changes to the inpcb synchronisation it is going to be broken. Even its current status after the move of PCB synchronization to the network epoch is very questionable.
This experimental feature was sponsored by Juniper but ended never to be used in Juniper and doesn't exist in their source tree [sjg@, stevek@, jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix [gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].
I'm up to resurrecting it back if there is any interest from anybody.
Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33020
show more ...
|
Revision tags: release/12.3.0 |
|
#
0f617ae4 |
| 18-Oct-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c.
|
#
744a64bd |
| 28-Apr-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
in_pcb: garbage collect in_pcbrele()
|
#
5a78df20 |
| 27-Apr-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
in_pcb: garbage collect unused structure in_pcblist
|
#
d7955cc0 |
| 06-Jul-2021 |
Randall Stewart <rrs@FreeBSD.org> |
tcp: HPTS performance enhancements
HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does t
tcp: HPTS performance enhancements
HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does this by now instead of relying on the timer/callout to drive it instead use user return from a system call as well as lro flushes to drive hpts. The timer becomes a backstop that dynamically adjusts based on how "late" we are.
Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31083
show more ...
|
#
1db08fbe |
| 16-Apr-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_input: always request read-locking of PCB for any pure SYN segment.
This is further rework of 08d9c920275. Now we carry the knowledge of lock type all the way through tcp_input() and also into
tcp_input: always request read-locking of PCB for any pure SYN segment.
This is further rework of 08d9c920275. Now we carry the knowledge of lock type all the way through tcp_input() and also into tcp_twcheck(). Ideally the rlocking for pure SYNs should propagate all the way into the alternative TCP stacks, but not yet today.
This should close a race when socket is bind(2)-ed but not yet listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered recently by pho@.
show more ...
|
Revision tags: release/13.0.0 |
|
#
08d9c920 |
| 19-Mar-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB. Normally SYN arrives on a listening socket, we either cr
tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB. Normally SYN arrives on a listening socket, we either create a syncache entry or generate syncookie, but we don't modify anything with the listening socket or associated PCB. Thus create a new PCB lookup mode - rlock if listening. This removes the primary contention point under SYN flood - the listening socket PCB.
Sidenote: when SYN arrives on a synchronized connection, we still don't need write access to PCB to send a challenge ACK or just to drop. There is only one exclusion - tcptw recycling. However, existing entanglement of tcp_input + stacks doesn't allow to make this change small. Consider this patch as first approach to the problem.
Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D29576
show more ...
|
#
69a34e8d |
| 27-Jan-2021 |
Randall Stewart <rrs@FreeBSD.org> |
Update the LRO processing code so that we can support a further CPU enhancements for compressed acks. These are acks that are compressed into an mbuf. The transport has to be aware of how to process
Update the LRO processing code so that we can support a further CPU enhancements for compressed acks. These are acks that are compressed into an mbuf. The transport has to be aware of how to process these, and an upcoming update to rack will do so. You need the rack changes to actually test and validate these since if the transport does not support mbuf compression, then the old code paths stay in place. We do in this commit take out the concept of logging if you don't have a lock (which was quite dangerous and was only for some early debugging but has been left in the code).
Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28374
show more ...
|