#
9ba11796 |
| 27-Jan-2022 |
Andrew Gallatin <gallatin@FreeBSD.org> |
Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
When ip_output_send() returns EAGAIN due to issues with send tags (route change, lagg failover, etc), it must free the m
Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
When ip_output_send() returns EAGAIN due to issues with send tags (route change, lagg failover, etc), it must free the mbuf. This is because ip_output_send() was written as a wrapper/replacement for a direct call to if_output(), and the contract with if_output() has historically been that it owns the mbufs once called. When ip_output_send() failed to free mbufs, it violated this assumption and lead to leaked mbufs.
This was noticed when using NIC TLS in combination with hardware rate-limited connections. When seeing lots of NIC output drops triggered ratelimit send tag changes, we noticed we were leaking ktls_sessions, send tags and mbufs. This was due ip_output_send() leaking mbufs which held references to ktls_sessions, which in turn held references to send tags.
Many thanks to jbh, rrs, hselasky and markj for their help in debugging this.
Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34054 Reviewed by: hselasky, jhb, rrs MFC after: 2 weeks
show more ...
|
Revision tags: release/12.3.0 |
|
#
44775b16 |
| 24-Nov-2021 |
Mark Johnston <markj@FreeBSD.org> |
netinet: Remove unneeded mb_unmapped_to_ext() calls
in_cksum_skip() now handles unmapped mbufs on platforms where they're permitted.
Reviewed by: glebius, jhb MFC after: 1 week Sponsored by: The Fr
netinet: Remove unneeded mb_unmapped_to_ext() calls
in_cksum_skip() now handles unmapped mbufs on platforms where they're permitted.
Reviewed by: glebius, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33097
show more ...
|
#
756bb50b |
| 16-Nov-2021 |
Mark Johnston <markj@FreeBSD.org> |
sctp: Remove now-unneeded mb_unmapped_to_ext() calls
sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply().
No functional change intended.
Reviewed by: tuexen MFC after: 2 weeks
sctp: Remove now-unneeded mb_unmapped_to_ext() calls
sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply().
No functional change intended.
Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32942
show more ...
|
#
2144431c |
| 08-Oct-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.
An IPv4 address is embedded into an ifaddr which is freed via epoch. And the in_ifaddrhead is already a CK list. Use the network epoch to pr
Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.
An IPv4 address is embedded into an ifaddr which is freed via epoch. And the in_ifaddrhead is already a CK list. Use the network epoch to protect against use after free.
Next step would be to CK-ify the in_addr hash and get rid of the...
Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D32434
show more ...
|
#
62e1a437 |
| 23-Aug-2021 |
Zhenlei Huang <zlei.huang@gmail.com> |
routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549).
Implement kernel support for RFC 5549/8950.
* Relax control plane restrictions and allow specifying IPv6 gateways for IPv4 routes. T
routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549).
Implement kernel support for RFC 5549/8950.
* Relax control plane restrictions and allow specifying IPv6 gateways for IPv4 routes. This behavior is controlled by the net.route.rib_route_ipv6_nexthop sysctl (on by default).
* Always pass final destination in ro->ro_dst in ip_forward().
* Use ro->ro_dst to exract packet family inside if_output() routines. Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.
* Pass extracted family to nd6_resolve() to get the LLE with proper encap. It leverages recent lltable changes committed in c541bd368f86.
Presence of the functionality can be checked using ipv4_rfc5549_support feature(3). Example usage: route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0
Differential Revision: https://reviews.freebsd.org/D30398 MFC after: 2 weeks
show more ...
|
#
9748eb74 |
| 07-Aug-2021 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Simplify nhop operations in ip_output().
Consistently use `nh` instead of always dereferencing ro->ro_nh inside the if block. Always use nexthop mtu, as it provides guarantee that mtu is accurate.
Simplify nhop operations in ip_output().
Consistently use `nh` instead of always dereferencing ro->ro_nh inside the if block. Always use nexthop mtu, as it provides guarantee that mtu is accurate. Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses of updating ro flags based on different nexthop.
Differential Revision: https://reviews.freebsd.org/D31451 Reviewed by: kp MFC after: 2 weeks
show more ...
|
#
65634ae7 |
| 23-Apr-2021 |
Wojciech Macek <wma@FreeBSD.org> |
mroute: fix race condition during mrouter shutting down
There is a race condition between V_ip_mrouter de-init and ip_mforward handling. It might happen that mrouted is cleaned up after
mroute: fix race condition during mrouter shutting down
There is a race condition between V_ip_mrouter de-init and ip_mforward handling. It might happen that mrouted is cleaned up after V_ip_mrouter check and before processing packet in ip_mforward. Use epoch call aproach, similar to IPSec which also handles such case.
Reported by: Damien Deville Obtained from: Stormshield Reviewed by: mw Differential Revision: https://reviews.freebsd.org/D29946
show more ...
|
Revision tags: release/13.0.0 |
|
#
3f43ada9 |
| 28-Jan-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Catch up with 6edfd179c86: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer that points to unmapped address. Then, this was e
Catch up with 6edfd179c86: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer that points to unmapped address. Then, this was extended to array of such pointers. Then, such mbufs were augmented with header/trailer. Basically, extended mbufs are extended, and set of features is subject to change. The new name should be generic enough to avoid further renaming.
show more ...
|
Revision tags: release/12.2.0 |
|
#
868aabb4 |
| 09-Oct-2020 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt) option IP(V6)_VLAN_PCP, which can be set to -1 (interface default), or explicitly to
Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt) option IP(V6)_VLAN_PCP, which can be set to -1 (interface default), or explicitly to any priority between 0 and 7.
Note that for untagged traffic, explicitly adding a priority will insert a special 801.1Q vlan header with vlan ID = 0 to carry the priority setting
Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26409
show more ...
|
#
fedeb08b |
| 03-Oct-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Introduce scalable route multipath.
This change is based on the nexthop objects landed in D24232.
The change introduces the concept of nexthop groups. Each group contains the collection of nexthops
Introduce scalable route multipath.
This change is based on the nexthop objects landed in D24232.
The change introduces the concept of nexthop groups. Each group contains the collection of nexthops with their relative weights and a dataplane-optimized structure to enable efficient nexthop selection.
Simular to the nexthops, nexthop groups are immutable. Dataplane part gets compiled during group creation and is basically an array of nexthop pointers, compiled w.r.t their weights.
With this change, `rt_nhop` field of `struct rtentry` contains either nexthop or nexthop group. They are distinguished by the presense of NHF_MULTIPATH flag. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem.
User-visible changes:
The change is intended to be backward-compatible: all non-mpath operations should work as before with ROUTE_MPATH and net.route.multipath=1.
All routes now comes with weight, default weight is 1, maximum is 2^24-1.
Current maximum multipath group width is statically set to 64. This will become sysctl-tunable in the followup changes.
Using functionality: * Recompile kernel with ROUTE_MPATH * set net.route.multipath to 1
route add -6 2001:db8::/32 2001:db8::2 -weight 10 route add -6 2001:db8::/32 2001:db8::3 -weight 20
netstat -6On
Nexthop groups data
Internet6: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- --------------------------------------- --------- 1 13 10 1 2001:db8::2 vlan2 14 20 2 2001:db8::3 vlan2
Next steps: * Land outbound hashing for locally-originated routes ( D26523 ). * Fix net/bird multipath (net/frr seems to work fine) * Add ROUTE_MPATH to GENERIC * Set net.route.multipath=1 by default
Tested by: olivier Reviewed by: glebius Relnotes: yes Differential Revision: https://reviews.freebsd.org/D26449
show more ...
|
#
2259a030 |
| 21-Sep-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Rework part of routing code to reduce difference to D26449.
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(), as these two can be applied at different entities and at different
Rework part of routing code to reduce difference to D26449.
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(), as these two can be applied at different entities and at different times. * Start filling route weight in route change notifications * Pass flowid to UDP/raw IP route lookups * Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact that rtentry can contain multiple nexthops.
Differential Revision: https://reviews.freebsd.org/D26497
show more ...
|
#
374ce248 |
| 18-Sep-2020 |
Mitchell Horne <mhorne@FreeBSD.org> |
Initialize some local variables earlier
Move the initialization of these variables to the beginning of their respective functions.
On our end this creates a small amount of unneeded churn, as these
Initialize some local variables earlier
Move the initialization of these variables to the beginning of their respective functions.
On our end this creates a small amount of unneeded churn, as these variables are properly initialized before their first use in all cases. However, changing this benefits at least one downstream consumer (NetApp) by allowing local and future modifications to these functions to be made without worrying about where the initialization occurs.
Reviewed by: melifaro, rscheff Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26454
show more ...
|
#
b092fd6c |
| 18-Sep-2020 |
Navdeep Parhar <np@FreeBSD.org> |
if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx and rx), TSO, and RSS if the NIC is capable
if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx and rx), TSO, and RSS if the NIC is capable of performing these operations on inner VXLAN traffic.
A VXLAN interface inherits the capabilities of its vxlandev interface if one is specified or of the interface that hosts the vxlanlocal address. If other interfaces will carry traffic for that VXLAN then they must have the same hardware capabilities.
On transmit, if_vxlan verifies that the outbound interface has the required capabilities and then translates the CSUM_ flags to their inner equivalents. This tells the hardware ifnet that it needs to operate on the inner frame and not the outer VXLAN headers.
An event is generated when a VXLAN ifnet starts. This allows hardware drivers to configure their devices to expect VXLAN traffic on the specified incoming port.
On receive, the hardware does RSS and checksum verification on the inner frame. if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is not very clear why it didn't do this already.
Future work: Rx: it should be possible to avoid the first trip up the protocol stack to get the frame to if_vxlan just so it can decapsulate and requeue for a second trip up the stack. The hardware NIC driver could directly call an if_vxlan receive routine for VXLAN traffic instead.
Rx: LRO. depends on what happens with the previous item. There will have to to be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.
Reviewed by: kib@ Relnotes: Yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25873
show more ...
|
#
662c1305 |
| 01-Sep-2020 |
Mateusz Guzik <mjg@FreeBSD.org> |
net: clean up empty lines in .c and .h files
|
#
95033af9 |
| 18-Jun-2020 |
Mark Johnston <markj@FreeBSD.org> |
Add the SCTP_SUPPORT kernel option.
This is in preparation for enabling a loadable SCTP stack. Analogous to IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured in order to suppor
Add the SCTP_SUPPORT kernel option.
This is in preparation for enabling a loadable SCTP stack. Analogous to IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured in order to support a loadable SCTP implementation.
Discussed with: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
show more ...
|
Revision tags: release/11.4.0 |
|
#
3553b300 |
| 28-May-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Switch ip_output/icmp_reflect rt lookup calls with fib4_lookup.
fib4_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack da
Switch ip_output/icmp_reflect rt lookup calls with fib4_lookup.
fib4_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack data copying.
Conversion is straight-forwarded, as the only 2 differences are requirement of running in network epoch and the need to handle RTF_GATEWAY case in the caller code.
Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24976
show more ...
|
#
174fb9db |
| 17-May-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Remove redundant checks for nhop validity. Currently NH_IS_VALID() simly aliases to RT_LINK_IS_UP(), so we're checking the same thing twice.
In the near future the implementation of this check wil
Remove redundant checks for nhop validity. Currently NH_IS_VALID() simly aliases to RT_LINK_IS_UP(), so we're checking the same thing twice.
In the near future the implementation of this check will be simpler, as there are plans to introduce control-plane interface status monitoring similar to ipfw interface tracker.
show more ...
|
#
6043ac20 |
| 11-May-2020 |
Andrew Gallatin <gallatin@FreeBSD.org> |
Ktls: never skip stamping tags for NIC TLS
The newer RACK and BBR TCP stacks have added a mechanism to disable hardware packet pacing for TCP retransmits. This mechanism works by skipping the send-t
Ktls: never skip stamping tags for NIC TLS
The newer RACK and BBR TCP stacks have added a mechanism to disable hardware packet pacing for TCP retransmits. This mechanism works by skipping the send-tag stamp on rate-limited connections when the TCP stack calls ip_output() with the IP_NO_SND_TAG_RL flag set.
When doing NIC TLS, we must ignore this flag, as NIC TLS packets must always be stamped. Failure to stamp a NIC TLS packet will result in crypto issues.
Reviewed by: hselasky, rrs Sponsored by: Netflix, Mellanox
show more ...
|
#
7b6c99d0 |
| 03-May-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf within m_epg namespace. All edits except the 'struct mbuf' declaration and mb_dupcl() were done mechanically with sed:
Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf within m_epg namespace. All edits except the 'struct mbuf' declaration and mb_dupcl() were done mechanically with sed:
s/->m_ext_pgs.nrdy/->m_epg_nrdy/g s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g s/->m_ext_pgs.trail_len/->m_epg_trllen/g s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g s/->m_ext_pgs.flags/->m_epg_flags/g s/->m_ext_pgs.record_type/->m_epg_record_type/g s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g s/->m_ext_pgs.tls/->m_epg_tls/g s/->m_ext_pgs.so/->m_epg_so/g s/->m_ext_pgs.seqno/->m_epg_seqno/g s/->m_ext_pgs.stailq/->m_epg_stailq/g
Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598
show more ...
|
#
4043ee3c |
| 28-Apr-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Convert rtalloc_mpath_fib() users to the new KPI.
New fib[46]_lookup() functions support multipath transparently. Given that, switch the last rtalloc_mpath_fib() calls to dib4_lookup() and eliminat
Convert rtalloc_mpath_fib() users to the new KPI.
New fib[46]_lookup() functions support multipath transparently. Given that, switch the last rtalloc_mpath_fib() calls to dib4_lookup() and eliminate the function itself.
Note: proper flowid generation (especially for the outbound traffic) is a bigger topic and will be handled in a separate review. This change leaves flowid generation intact.
Differential Revision: https://reviews.freebsd.org/D24595
show more ...
|
#
983066f0 |
| 25-Apr-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.
Nexthops are separate datastructures, containing all necessary information to perfor
Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.
Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms.
Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn.
Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same.
Differential Revision: https://reviews.freebsd.org/D24340
show more ...
|
#
23feb563 |
| 14-Apr-2020 |
Andrew Gallatin <gallatin@FreeBSD.org> |
KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to
KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to carry more than a single page for sendfile, they are rather cache unfriendly when accessing the ext_pgs metadata and data. This is because the ext_pgs part of the mbuf is allocated separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs are never used at the same time as pkthdr mbufs. Given this fact, we can overlap the ext_pgs metadata with the mbuf pkthdr, and carry the ext_pgs meta directly in the mbuf itself. Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum required for a 16K TLS record which is not perfectly aligned) on LP64, I've had to steal ext_arg2. The only user of this in the xmit path is sendfile, and I've adjusted it to use arg1 when using unmapped mbufs.
This change is almost entirely mechanical, except that we change mb_alloc_ext_pgs() to no longer allow allocating pkthdrs, the change to avoid ext_arg2 as mentioned above, and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over 3% "scaled" CPU on a Netflix 100% software kTLS workload at 90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid access ext_pgs fields of mbufs, since they will now be in cache.
Many thanks to glebius for helping to make this better in the Netflix tree.
Reviewed by: hselasky, jhb, rrs, glebius (early version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24213
show more ...
|
#
051669e8 |
| 25-Jan-2020 |
Dimitry Andric <dim@FreeBSD.org> |
Merge ^/head r356931 through r357118.
|
#
b9555453 |
| 22-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Make ip6_output() and ip_output() require network epoch.
All callers that before may called into these functions without network epoch now must enter it.
|
#
8d5c56da |
| 01-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
In r343631 error code for a packet blocked by a firewall was changed from EACCES to EPERM. This change was not intentional, so fix that. Return EACCESS if a firewall forbids sending.
Noticed by: ae
|