History log of /linux/tools/testing/selftests/net/Makefile (Results 1 – 25 of 1107)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 0ad9617c 22-Jan-2025 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
"This is slightly smaller than usual, with the most interesting

Merge tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
"This is slightly smaller than usual, with the most interesting work
being still around RTNL scope reduction.

Core:

- More core refactoring to reduce the RTNL lock contention, including
preparatory work for the per-network namespace RTNL lock, replacing
RTNL lock with a per device-one to protect NAPI-related net device
data and moving synchronize_net() calls outside such lock.

- Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge
and more specific TCP coverage.

- Reduce network namespace tear-down time by removing per-subsystems
synchronize_net() in tipc and sched.

- Add flow label selector support for fib rules, allowing traffic
redirection based on such header field.

Netfilter:

- Do not remove netdev basechain when last device is gone, allowing
netdev basechains without devices.

- Revisit the flowtable teardown strategy, dealing better with fin,
reset and re-open events.

- Scale-up IP-vs connection dumping by avoiding linear search on each
restart.

Protocols:

- A significant XDP socket refactor, consolidating and optimizing
several helpers into the core

- Better scaling of ICMP rate-limiting, by removing false-sharing in
inet peers handling.

- Introduces netlink notifications for multicast IPv4 and IPv6
address changes.

- Add ipsec support for IP-TFS/AggFrag encapsulation, allowing
aggregation and fragmentation of the inner IP.

- Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to
avoid local port exhaustion issues when the average connection
lifetime is very short.

- Support updating keys (re-keying) for connections using kernel TLS
(for TLS 1.3 only).

- Support ipv4-mapped ipv6 address clients in smc-r v2.

- Add support for jumbo data packet transmission in RxRPC sockets,
gluing multiple data packets in a single UDP packet.

- Support RxRPC RACK-TLP to manage packet loss and retransmission in
conjunction with the congestion control algorithm.

Driver API:

- Introduce a unified and structured interface for reporting PHY
statistics, exposing consistent data across different H/W via
ethtool.

- Make timestamping selectable, allow the user to select the desired
hwtstamp provider (PHY or MAC) administratively.

- Add support for configuring a header-data-split threshold (HDS)
value via ethtool, to deal with partial or buggy H/W
implementation.

- Consolidate DSA drivers Energy Efficiency Ethernet support.

- Add EEE management to phylink, making use of the phylib
implementation.

- Add phylib support for in-band capabilities negotiation.

- Simplify how phylib-enabled mac drivers expose the supported
interfaces.

Tests and tooling:

- Make the YNL tool package-friendly to make it easier to deploy it
separately from the kernel.

- Increase TCP selftest coverage importing several packetdrill
test-cases.

- Regenerate the ethtool uapi header from the YNL spec, to ease
maintenance and future development.

- Add YNL support for decoding the link types used in net self-tests,
allowing a single build to run both net and drivers/net.

Drivers:

- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- add cross E-Switch QoS support
- add SW Steering support for ConnectX-8
- implement support for HW-Managed Flow Steering, improving the
rule deletion/insertion rate
- support for multi-host LAG
- Intel (ixgbe, ice, igb):
- ice: add support for devlink health events
- ixgbe: add initial support for E610 chipset variant
- igb: add support for AF_XDP zero-copy
- Meta:
- add support for basic RSS config
- allow changing the number of channels
- add hardware monitoring support
- Broadcom (bnxt):
- implement TCP data split and HDS threshold ethtool support,
enabling Device Memory TCP.
- Marvell Octeon:
- implement egress ipsec offload support for the cn10k family
- Hisilicon (HIBMC):
- implement unicast MAC filtering

- Ethernet NICs embedded and virtual:
- Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding
contented atomic operations for drop counters
- Freescale:
- quicc: phylink conversion
- enetc: support Tx and Rx checksum offload and improve TSO
performances
- MediaTek:
- airoha: introduce support for ETS and HTB Qdisc offload
- Microchip:
- lan78XX USB: preparation work for phylink conversion
- Synopsys (stmmac):
- support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45
- refactor EEE support to leverage the new driver API
- optimize DMA and cache access to increase raw RX performances
by 40%
- TI:
- icssg-prueth: add multicast filtering support for VLAN
interface
- netkit:
- add ability to configure head/tailroom
- VXLAN:
- accepts packets with user-defined reserved bit

- Ethernet switches:
- Microchip:
- lan969x: add RGMII support
- lan969x: improve TX and RX performance using the FDMA engine
- nVidia/Mellanox:
- move Tx header handling to PCI driver, to ease XDP support

- Ethernet PHYs:
- Texas Instruments DP83822:
- add support for GPIO2 clock output
- Realtek:
- 8169: add support for RTL8125D rev.b
- rtl822x: add hwmon support for the temperature sensor
- Microchip:
- add support for RDS PTP hardware
- consolidate periodic output signal generation

- CAN:
- several DT-bindings to DT schema conversions
- tcan4x5x:
- add HW standby support
- support nWKRQ voltage selection
- kvaser:
- allowing Bus Error Reporting runtime configuration

- WiFi:
- the on-going Multi-Link Operation (MLO) effort continues,
affecting both the stack and in drivers
- mac80211/cfg80211:
- Emergency Preparedness Communication Services (EPCS) station
mode support
- support for adding and removing station links for MLO
- add support for WiFi 7/EHT mesh over 320 MHz channels
- report Tx power info for each link
- RealTek (rtw88):
- enable USB Rx aggregation and USB 3 to improve performance
- LED support
- RealTek (rtw89):
- refactor power save to support Multi-Link Operations
- add support for RTL8922AE-VS variant
- MediaTek (mt76):
- single wiphy multiband support (preparation for MLO)
- p2p device support
- add TP-Link TXE50UH USB adapter support
- Qualcomm (ath10k):
- support for the QCA6698AQ IP core
- Qualcomm (ath12k):
- enable MLO for QCN9274

- Bluetooth:
- Allow sysfs to trigger hdev reset, to allow recovering devices
not responsive from user-space
- MediaTek: add support for MT7922, MT7925, MT7921e devices
- Realtek: add support for RTL8851BE devices
- Qualcomm: add support for WCN785x devices
- ISO: allow BIG re-sync"

* tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits)
net/rose: prevent integer overflows in rose_setsockopt()
net: phylink: fix regression when binding a PHY
net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup
net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup
net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path
ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL.
ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL.
ipv6: Move lifetime validation to inet6_rtm_newaddr().
ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().
ipv6: Pass dev to inet6_addr_add().
ipv6: Convert inet6_ioctl() to per-netns RTNL.
ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().
ipv6: Hold rtnl_net_lock() in addrconf_dad_work().
ipv6: Hold rtnl_net_lock() in addrconf_verify_work().
ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.
ipv6: Add __in6_dev_get_rtnl_net().
net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags
net: mii: Fix the Speed display when the network cable is not connected
sysctl net: Remove macro checks for CONFIG_SYSCTL
eth: bnxt: update header sizing defaults
...

show more ...


# 25768de5 21-Jan-2025 Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge branch 'next' into for-linus

Prepare input updates for 6.14 merge window.


# 670af65d 20-Jan-2025 Jiri Kosina <jkosina@suse.com>

Merge branch 'for-6.14/constify-bin-attribute' into for-linus

- constification of 'struct bin_attribute' in various HID driver (Thomas Weißschuh)


Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4
# c1bc6d21 20-Dec-2024 Jakub Kicinski <kuba@kernel.org>

Merge branch 'bridge-handle-changes-in-vlan_flag_bridge_binding'

Petr Machata says:

====================
bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING

When bridge binding is enabled on a VLAN

Merge branch 'bridge-handle-changes-in-vlan_flag_bridge_binding'

Petr Machata says:

====================
bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING

When bridge binding is enabled on a VLAN netdevice, its link state should
track bridge ports that are members of the corresponding VLAN. This works
for a newly-added netdevices. However toggling the option does not have the
effect of enabling or disabling the behavior as appropriate.

In this patchset, have bridge react to bridge_binding toggles on VLAN
uppers.

There has been another attempt at supporting this behavior in 2022 by
Sevinj Aghayeva [0]. A discussion ensued that informed how this new
patchset is constructed, namely that the logic is in the bridge as opposed
to the 8021q driver, and the bridge reacts to NETDEV_CHANGE events on the
8021q upper.

Patches #1 and #2 contain the implementation, patches #3 and #4 a
selftest.

[0] https://lore.kernel.org/netdev/cover.1660100506.git.sevinj.aghayeva@gmail.com/
====================

Link: https://patch.msgid.link/cover.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# dca12e9a 18-Dec-2024 Petr Machata <petrm@nvidia.com>

selftests: net: Add a VLAN bridge binding selftest

Add a test that exercises bridge binding.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: N

selftests: net: Add a VLAN bridge binding selftest

Add a test that exercises bridge binding.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/baf7244fd1fe223a6d93e027584fa9f99dee982c.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# 6d4a0f4e 17-Dec-2024 Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge tag 'v6.13-rc3' into next

Sync up with the mainline.


# 9163b05e 17-Dec-2024 Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-support-for-so_priority-cmsg'

Anna Emese Nyiri says:

====================
Add support for SO_PRIORITY cmsg

Introduce a new helper function, `sk_set_prio_allowed`,
to centralize t

Merge branch 'add-support-for-so_priority-cmsg'

Anna Emese Nyiri says:

====================
Add support for SO_PRIORITY cmsg

Introduce a new helper function, `sk_set_prio_allowed`,
to centralize the logic for validating priority settings.
Add support for the `SO_PRIORITY` control message,
enabling user-space applications to set socket priority
via control messages (cmsg).
====================

Link: https://patch.msgid.link/20241213084457.45120-1-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


Revision tags: v6.13-rc3
# cda7d5ab 13-Dec-2024 Anna Emese Nyiri <annaemesenyiri@gmail.com>

selftests: net: test SO_PRIORITY ancillary data with cmsg_sender

Extend cmsg_sender.c with a new option '-Q' to send SO_PRIORITY
ancillary data.

cmsg_so_priority.sh script added to validate SO_PRIO

selftests: net: test SO_PRIORITY ancillary data with cmsg_sender

Extend cmsg_sender.c with a new option '-Q' to send SO_PRIORITY
ancillary data.

cmsg_so_priority.sh script added to validate SO_PRIORITY behavior
by creating VLAN device with egress QoS mapping and testing packet
priorities using flower filters. Verify that packets with different
priorities are correctly matched and counted by filters for multiple
protocols and IP versions.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Link: https://patch.msgid.link/20241213084457.45120-4-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# c5fb51b7 03-Jan-2025 Rob Clark <robdclark@chromium.org>

Merge remote-tracking branch 'pm/opp/linux-next' into HEAD

Merge pm/opp tree to get dev_pm_opp_get_bw()

Signed-off-by: Rob Clark <robdclark@chromium.org>


# e7f0a3a6 11-Dec-2024 Rodrigo Vivi <rodrigo.vivi@intel.com>

Merge drm/drm-next into drm-intel-next

Catching up with 6.13-rc2.

Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>


Revision tags: v6.13-rc2
# 8f109f28 02-Dec-2024 Rodrigo Vivi <rodrigo.vivi@intel.com>

Merge drm/drm-next into drm-xe-next

A backmerge to get the PMT preparation work for
merging the BMG PMT support.

Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>


# 3aba2eba 02-Dec-2024 Maxime Ripard <mripard@kernel.org>

Merge drm/drm-next into drm-misc-next

Kickstart 6.14 cycle.

Signed-off-by: Maxime Ripard <mripard@kernel.org>


# bcfd5f64 02-Dec-2024 Ingo Molnar <mingo@kernel.org>

Merge tag 'v6.13-rc1' into perf/core, to refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c34e9ab9 05-Dec-2024 Takashi Iwai <tiwai@suse.de>

Merge tag 'asoc-fix-v6.13-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v6.13

A few small fixes for v6.13, all system specific - the biggest t

Merge tag 'asoc-fix-v6.13-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v6.13

A few small fixes for v6.13, all system specific - the biggest thing is
the fix for jack handling over suspend on some Intel laptops.

show more ...


Revision tags: v6.13-rc1
# 65ae975e 28-Nov-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth.

Current release - regressions:

-

Merge tag 'net-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth.

Current release - regressions:

- rtnetlink: fix rtnl_dump_ifinfo() error path

- bluetooth: remove the redundant sco_conn_put

Previous releases - regressions:

- netlink: fix false positive warning in extack during dumps

- sched: sch_fq: don't follow the fast path if Tx is behind now

- ipv6: delete temporary address if mngtmpaddr is removed or
unmanaged

- tcp: fix use-after-free of nreq in reqsk_timer_handler().

- bluetooth: fix slab-use-after-free Read in set_powered_sync

- l2tp: fix warning in l2tp_exit_net found

- eth:
- bnxt_en: fix receive ring space parameters when XDP is active
- lan78xx: fix double free issue with interrupt buffer allocation
- tg3: set coherent DMA mask bits to 31 for BCM57766 chipsets

Previous releases - always broken:

- ipmr: fix tables suspicious RCU usage

- iucv: MSG_PEEK causes memory leak in iucv_sock_destruct()

- eth:
- octeontx2-af: fix low network performance
- stmmac: dwmac-socfpga: set RX watchdog interrupt as broken
- rtase: correct the speed for RTL907XD-V1

Misc:

- some documentation fixup"

* tag 'net-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits)
ipmr: fix build with clang and DEBUG_NET disabled.
Documentation: tls_offload: fix typos and grammar
Fix spelling mistake
ipmr: fix tables suspicious RCU usage
ip6mr: fix tables suspicious RCU usage
ipmr: add debug check for mr table cleanup
selftests: rds: move test.py to TEST_FILES
net_sched: sch_fq: don't follow the fast path if Tx is behind now
tcp: Fix use-after-free of nreq in reqsk_timer_handler().
net: phy: fix phy_ethtool_set_eee() incorrectly enabling LPI
net: Comment copy_from_sockptr() explaining its behaviour
rxrpc: Improve setsockopt() handling of malformed user input
llc: Improve setsockopt() handling of malformed user input
Bluetooth: SCO: remove the redundant sco_conn_put
Bluetooth: MGMT: Fix possible deadlocks
Bluetooth: MGMT: Fix slab-use-after-free Read in set_powered_sync
bnxt_en: Unregister PTP during PCI shutdown and suspend
bnxt_en: Refactor bnxt_ptp_init()
bnxt_en: Fix receive ring space parameters when XDP is active
bnxt_en: Fix queue start to update vnic RSS table
...

show more ...


# 9bb88c65 19-Nov-2024 Jakub Kicinski <kuba@kernel.org>

selftests: net: test extacks in netlink dumps

Test that extacks in dumps work. The test fills up the receive buffer
to test both the inline dump (as part of sendmsg()) and delayed one
(run during re

selftests: net: test extacks in netlink dumps

Test that extacks in dumps work. The test fills up the receive buffer
to test both the inline dump (as part of sendmsg()) and delayed one
(run during recvmsg()).

Use YNL helpers to parse the messages. We need to add the test to YNL
file to make sure the right include path are used.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20241119224432.1713040-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# cf87766d 26-Nov-2024 Christian Brauner <brauner@kernel.org>

Merge branch 'ovl.fixes'

Bring in an overlayfs fix for v6.13-rc1 that fixes a bug introduced by
the overlayfs changes merged for v6.13.

Signed-off-by: Christian Brauner <brauner@kernel.org>


Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4
# 77b67945 14-Oct-2024 Namhyung Kim <namhyung@kernel.org>

Merge tag 'v6.12-rc3' into perf-tools-next

To get the fixes in the current perf-tools tree.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>


Revision tags: v6.12-rc3, v6.12-rc2
# 3fd6c590 30-Sep-2024 Jerome Brunet <jbrunet@baylibre.com>

Merge tag 'v6.12-rc1' into clk-meson-next

Linux 6.12-rc1


# fcc79e17 21-Nov-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
"The most significant set of changes is the per netns RTNL. The

Merge tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
"The most significant set of changes is the per netns RTNL. The new
behavior is disabled by default, regression risk should be contained.

Notably the new config knob PTP_1588_CLOCK_VMCLOCK will inherit its
default value from PTP_1588_CLOCK_KVM, as the first is intended to be
a more reliable replacement for the latter.

Core:

- Started a very large, in-progress, effort to make the RTNL lock
scope per network-namespace, thus reducing the lock contention
significantly in the containerized use-case, comprising:
- RCU-ified some relevant slices of the FIB control path
- introduce basic per netns locking helpers
- namespacified the IPv4 address hash table
- remove rtnl_register{,_module}() in favour of
rtnl_register_many()
- refactor rtnl_{new,del,set}link() moving as much validation as
possible out of RTNL lock
- convert all phonet doit() and dumpit() handlers to RCU
- convert IPv4 addresses manipulation to per-netns RTNL
- convert virtual interface creation to per-netns RTNL
the per-netns lock infrastructure is guarded by the
CONFIG_DEBUG_NET_SMALL_RTNL knob, disabled by default ad interim.

- Introduce NAPI suspension, to efficiently switching between busy
polling (NAPI processing suspended) and normal processing.

- Migrate the IPv4 routing input, output and control path from direct
ToS usage to DSCP macros. This is a work in progress to make ECN
handling consistent and reliable.

- Add drop reasons support to the IPv4 rotue input path, allowing
better introspection in case of packets drop.

- Make FIB seqnum lockless, dropping RTNL protection for read access.

- Make inet{,v6} addresses hashing less predicable.

- Allow providing timestamp OPT_ID via cmsg, to correlate TX packets
and timestamps

Things we sprinkled into general kernel code:

- Add small file operations for debugfs, to reduce the struct ops
size.

- Refactoring and optimization for the implementation of page_frag
API, This is a preparatory work to consolidate the page_frag
implementation.

Netfilter:

- Optimize set element transactions to reduce memory consumption

- Extended netlink error reporting for attribute parser failure.

- Make legacy xtables configs user selectable, giving users the
option to configure iptables without enabling any other config.

- Address a lot of false-positive RCU issues, pointed by recent CI
improvements.

BPF:

- Put xsk sockets on a struct diet and add various cleanups. Overall,
this helps to bump performance by 12% for some workloads.

- Extend BPF selftests to increase coverage of XDP features in
combination with BPF cpumap.

- Optimize and homogenize bpf_csum_diff helper for all archs and also
add a batch of new BPF selftests for it.

- Extend netkit with an option to delegate skb->{mark,priority}
scrubbing to its BPF program.

- Make the bpf_get_netns_cookie() helper available also to tc(x) BPF
programs.

Protocols:

- Introduces 4-tuple hash for connected udp sockets, speeding-up
significantly connected sockets lookup.

- Add a fastpath for some TCP timers that usually expires after
close, the socket lock contention.

- Add inbound and outbound xfrm state caches to speed up state
lookups.

- Avoid sending MPTCP advertisements on stale subflows, reducing
risks on loosing them.

- Make neighbours table flushing more scalable, maintaining per
device neigh lists.

Driver API:

- Introduce a unified interface to configure transmission H/W
shaping, and expose it to user-space via generic-netlink.

- Add support for per-NAPI config via netlink. This makes napi
configuration persistent across queues removal and re-creation.
Requires driver updates, currently supported drivers are:
nVidia/Mellanox mlx4 and mlx5, Broadcom brcm and Intel ice.

- Add ethtool support for writing SFP / PHY firmware blocks.

- Track RSS context allocation from ethtool core.

- Implement support for mirroring to DSA CPU port, via TC mirror
offload.

- Consolidate FDB updates notification, to avoid duplicates on
device-specific entries.

- Expose DPLL clock quality level to the user-space.

- Support master-slave PHY config via device tree.

Tests and tooling:

- forwarding: introduce deferred commands, to simplify the cleanup
phase

Drivers:

- Updated several drivers - Amazon vNic, Google vNic, Microsoft vNic,
Intel e1000e and Broadcom Tigon3 - to use netdev-genl to link the
IRQs and queues to NAPI IDs, allowing busy polling and better
introspection.

- Ethernet high-speed NICs:
- nVidia/Mellanox:
- mlx5:
- a large refactor to implement support for cross E-Switch
scheduling
- refactor H/W conter management to let it scale better
- H/W GRO cleanups
- Intel (100G, ice)::
- add support for ethtool reset
- implement support for per TX queue H/W shaping
- AMD/Solarflare:
- implement per device queue stats support
- Broadcom (bnxt):
- improve wildcard l4proto on IPv4/IPv6 ntuple rules
- Marvell Octeon:
- Add representor support for each Resource Virtualization Unit
(RVU) device.
- Hisilicon:
- add support for the BMC Gigabit Ethernet
- IBM (EMAC):
- driver cleanup and modernization
- Cisco (VIC):
- raise the queues number limit to 256

- Ethernet virtual:
- Google vNIC:
- implement page pool support
- macsec:
- inherit lower device's features and TSO limits when
offloading
- virtio_net:
- enable premapped mode by default
- support for XDP socket(AF_XDP) zerocopy TX
- wireguard:
- set the TSO max size to be GSO_MAX_SIZE, to aggregate larger
packets.

- Ethernet NICs embedded and virtual:
- Broadcom ASP:
- enable software timestamping
- Freescale:
- add enetc4 PF driver
- MediaTek: Airoha SoC:
- implement BQL support
- RealTek r8169:
- enable TSO by default on r8168/r8125
- implement extended ethtool stats
- Renesas AVB:
- enable TX checksum offload
- Synopsys (stmmac):
- support header splitting for vlan tagged packets
- move common code for DWMAC4 and DWXGMAC into a separate FPE
module.
- add dwmac driver support for T-HEAD TH1520 SoC
- Synopsys (xpcs):
- driver refactor and cleanup
- TI:
- icssg_prueth: add VLAN offload support
- Xilinx emaclite:
- add clock support

- Ethernet switches:
- Microchip:
- implement support for the lan969x Ethernet switch family
- add LAN9646 switch support to KSZ DSA driver

- Ethernet PHYs:
- Marvel: 88q2x: enable auto negotiation
- Microchip: add support for LAN865X Rev B1 and LAN867X Rev C1/C2

- PTP:
- Add support for the Amazon virtual clock device
- Add PtP driver for s390 clocks

- WiFi:
- mac80211
- EHT 1024 aggregation size for transmissions
- new operation to indicate that a new interface is to be added
- support radio separation of multi-band devices
- move wireless extension spy implementation to libiw
- Broadcom:
- brcmfmac: optional LPO clock support
- Microchip:
- add support for Atmel WILC3000
- Qualcomm (ath12k):
- firmware coredump collection support
- add debugfs support for a multitude of statistics
- Qualcomm (ath5k):
- Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support
- Realtek:
- rtw88: 8821au and 8812au USB adapters support
- rtw89: add thermal protection
- rtw89: fine tune BT-coexsitence to improve user experience
- rtw89: firmware secure boot for WiFi 6 chip

- Bluetooth
- add Qualcomm WCN785x support for ids Foxconn 0xe0fc/0xe0f3 and
0x13d3:0x3623
- add Realtek RTL8852BE support for id Foxconn 0xe123
- add MediaTek MT7920 support for wireless module ids
- btintel_pcie: add handshake between driver and firmware
- btintel_pcie: add recovery mechanism
- btnxpuart: add GPIO support to power save feature"

* tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1475 commits)
mm: page_frag: fix a compile error when kernel is not compiled
Documentation: tipc: fix formatting issue in tipc.rst
selftests: nic_performance: Add selftest for performance of NIC driver
selftests: nic_link_layer: Add selftest case for speed and duplex states
selftests: nic_link_layer: Add link layer selftest for NIC driver
bnxt_en: Add FW trace coredump segments to the coredump
bnxt_en: Add a new ethtool -W dump flag
bnxt_en: Add 2 parameters to bnxt_fill_coredump_seg_hdr()
bnxt_en: Add functions to copy host context memory
bnxt_en: Do not free FW log context memory
bnxt_en: Manage the FW trace context memory
bnxt_en: Allocate backing store memory for FW trace logs
bnxt_en: Add a 'force' parameter to bnxt_free_ctx_mem()
bnxt_en: Refactor bnxt_free_ctx_mem()
bnxt_en: Add mem_valid bit to struct bnxt_ctx_mem_type
bnxt_en: Update firmware interface spec to 1.10.3.85
selftests/bpf: Add some tests with sockmap SK_PASS
bpf: fix recursive lock when verdict program return SK_PASS
wireguard: device: support big tcp GSO
wireguard: selftests: load nf_conntrack if not present
...

show more ...


# e709d442 16-Nov-2024 Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ndo_fdb_add-del-have-drivers-report-whether-they-notified'

Petr Machata says:

====================
net: ndo_fdb_add/del: Have drivers report whether they notified

Currently when

Merge branch 'net-ndo_fdb_add-del-have-drivers-report-whether-they-notified'

Petr Machata says:

====================
net: ndo_fdb_add/del: Have drivers report whether they notified

Currently when FDB entries are added to or deleted from a VXLAN netdevice,
the VXLAN driver emits one notification, including the VXLAN-specific
attributes. The core however always sends a notification as well, a generic
one. Thus two notifications are unnecessarily sent for these operations. A
similar situation comes up with bridge driver, which also emits
notifications on its own.

# ip link add name vx type vxlan id 1000 dstport 4789
# bridge monitor fdb &
[1] 1981693
# bridge fdb add de:ad:be:ef:13:37 dev vx self dst 192.0.2.1
de:ad:be:ef:13:37 dev vx dst 192.0.2.1 self permanent
de:ad:be:ef:13:37 dev vx self permanent

In order to prevent this duplicity, add a parameter, bool *notified, to
ndo_fdb_add and ndo_fdb_del. The flag is primed to false, and if the callee
sends a notification on its own, it sets the flag to true, thus informing
the core that it should not generate another notification.

Patches #1 to #2 are concerned with the above.

In the remaining patches, #3 to #7, add a selftest. This takes place across
several patches. Many of the helpers we would like to use for the test are
in forwarding/lib.sh, whereas net/ is a more suitable place for the test,
so the libraries need to be massaged a bit first.
====================

Link: https://patch.msgid.link/cover.1731589511.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# 15880bec 14-Nov-2024 Petr Machata <petrm@nvidia.com>

selftests: net: fdb_notify: Add a test for FDB notifications

Check that only one notification is produced for various FDB edit
operations.

Regarding the ip_link_add() and ip_link_master() helpers.

selftests: net: fdb_notify: Add a test for FDB notifications

Check that only one notification is produced for various FDB edit
operations.

Regarding the ip_link_add() and ip_link_master() helpers. This pattern of
action plus corresponding defer is bound to come up often, and a dedicated
vocabulary to capture it will be handy. tunnel_create() and vlan_create()
from forwarding/lib.sh are somewhat opaque and perhaps too kitchen-sinky,
so I tried to go in the opposite direction with these ones, and wrapped
only the bare minimum to schedule a corresponding cleanup.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://patch.msgid.link/910c5880ae6d3b558d6889cbdba2be690c2615c6.1731589511.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# a79993b5 14-Nov-2024 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.12-rc8).

Conflicts:

tools/testing/selftests/net/.gitignore
252e01e68241 ("s

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.12-rc8).

Conflicts:

tools/testing/selftests/net/.gitignore
252e01e68241 ("selftests: net: add netlink-dumps to .gitignore")
be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw")
https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/

drivers/net/phy/phylink.c
671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled")
7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"")

Adjacent changes:

drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c
5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines")
e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# 80b6f094 12-Nov-2024 Jakub Kicinski <kuba@kernel.org>

Merge branch 'suspend-irqs-during-application-busy-periods'

Joe Damato says:

====================
Suspend IRQs during application busy periods

This series introduces a new mechanism, IRQ suspensio

Merge branch 'suspend-irqs-during-application-busy-periods'

Joe Damato says:

====================
Suspend IRQs during application busy periods

This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.

Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.

I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].

~ The short explanation (TL;DR)

We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.

If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.

If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).

The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.

For a more in depth explanation, please continue reading.

~ Comparison with existing mechanisms

Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:

At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.

At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.

While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.

An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.

We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.

This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.

~ Usage details

IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.

Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.

- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.

- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.

- The user application then calls epoll_wait to busy poll for network
events, as it normally would.

- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.

- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.

As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.

Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.

~ Usage scenario

The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).

gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.

irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.

~ Important call out in the implementation

- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.

~ Benchmark configs & descriptions

The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].

To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.

Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):

- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)

~ Benchmark results

Tested on:

Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC

The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.

Results:

Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.

The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.

These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.

The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.

Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).

The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).

Note: we've reorganized the results to make comparison among testcases
with the same load easier.

testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199946 112 239 416 26 12973 11343
defer10 200K 199971 54 124 142 29 19412 17460
defer20 200K 199986 60 130 153 26 15644 14095
defer50 200K 200025 79 144 182 23 12122 11632
defer200 200K 199999 164 254 309 19 8923 9635
fullbusy 200K 199998 46 118 133 100 43658 23133
napibusy 200K 199983 100 237 277 56 24840 24716
suspend0 200K 200020 105 249 432 30 14264 11796
suspend10 200K 199950 53 123 141 32 19518 16903
suspend20 200K 200037 58 126 151 30 16426 14736
suspend50 200K 199961 73 136 177 26 13310 12633
suspend200 200K 199998 149 251 306 21 9566 10203

testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400014 139 269 707 41 9476 9343
defer10 400K 400016 59 133 166 53 13991 12989
defer20 400K 399952 67 140 172 47 12063 11644
defer50 400K 400007 87 162 198 39 9384 9880
defer200 400K 399979 181 274 330 31 7089 8430
fullbusy 400K 399987 50 123 156 100 21827 16037
napibusy 400K 400014 76 222 272 83 18185 16529
suspend0 400K 400015 127 350 776 47 10699 9603
suspend10 400K 400023 57 129 164 54 13758 13178
suspend20 400K 400043 62 135 169 49 12071 11826
suspend50 400K 400071 76 149 186 42 10011 10301
suspend200 400K 399961 154 269 327 34 7827 8774

testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 599951 149 266 574 61 9265 8876
defer10 600K 600006 71 147 203 76 11866 10936
defer20 600K 600123 76 152 203 66 10430 10342
defer50 600K 600162 95 172 217 54 8526 9142
defer200 600K 599942 200 301 357 46 6977 8212
fullbusy 600K 599990 55 127 177 100 14551 13983
napibusy 600K 600035 63 160 250 96 13937 14140
suspend0 600K 599903 127 320 732 68 10166 8963
suspend10 600K 599908 63 137 192 69 10902 11100
suspend20 600K 599961 66 141 194 65 9976 10370
suspend50 600K 599973 80 159 204 57 8678 9381
suspend200 600K 600010 157 277 346 48 7133 8381

testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800039 181 300 536 87 9585 8304
defer10 800K 800038 181 530 939 96 10564 8970
defer20 800K 800029 112 225 329 90 10056 8935
defer50 800K 799999 120 208 296 82 9234 8562
defer200 800K 800066 227 338 401 63 7117 8129
fullbusy 800K 800040 61 134 190 100 10913 12608
napibusy 800K 799944 64 141 214 99 10828 12588
suspend0 800K 799911 126 248 509 85 9346 8498
suspend10 800K 800006 69 143 200 83 9410 9845
suspend20 800K 800120 74 150 207 78 8786 9454
suspend50 800K 799989 87 168 224 71 7946 8833
suspend200 800K 799987 160 292 357 62 6923 8229

testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 906879 4079 5751 6216 98 9496 7904
defer10 1000K 860849 3643 6274 6730 99 10040 8676
defer20 1000K 896063 3298 5840 6349 98 9620 8237
defer50 1000K 919782 2962 5513 5807 97 9284 7951
defer200 1000K 970941 3059 5348 5984 95 8593 7959
fullbusy 1000K 999950 70 150 207 100 8732 10777
napibusy 1000K 999996 78 154 223 100 8722 10656
suspend0 1000K 949706 2666 5770 6660 99 9071 8046
suspend10 1000K 1000024 80 160 220 92 8137 9035
suspend20 1000K 1000059 83 165 226 89 7850 8804
suspend50 1000K 999955 95 180 240 84 7411 8459
suspend200 1000K 999914 163 299 366 77 6833 8078

testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1037654 4184 5453 5810 100 8411 7938
defer10 MAX 905607 4840 6151 6380 100 9639 8431
defer20 MAX 986463 4455 5594 5796 100 8848 8110
defer50 MAX 1077030 4000 5073 5299 100 8104 7920
defer200 MAX 1040728 4152 5385 5765 100 8379 7849
fullbusy MAX 1247536 3518 3935 3984 100 6998 7930
napibusy MAX 1136310 3799 7756 9964 100 7670 7877
suspend0 MAX 1057509 4132 5724 6185 100 8253 7918
suspend10 MAX 1215147 3580 3957 4041 100 7185 7944
suspend20 MAX 1216469 3576 3953 3988 100 7175 7950
suspend50 MAX 1215871 3577 3961 4075 100 7181 7949
suspend200 MAX 1216882 3556 3951 3988 100 7175 7955

~ FAQ

- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?

Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.

There are essentially three possible loops for network processing and
packet delivery:

1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping

Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.

If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.

If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.

Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.

This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.

- Can the new timeout value be threaded through the new epoll ioctl ?

It is possible, but presents challenges for userspace. User
applications must ensure that the file descriptors added to epoll
contexts have the same NAPI ID to support busy polling.

An epoll context is not permanently tied to any particular NAPI ID.
So, a user application could decide to clear the file descriptors
from the context and add a new set of file descriptors with a
different NAPI ID to the context. Busy polling would work as
expected, but the meaning of the suspend timeout becomes ambiguous
because IRQs are not inherently associated with epoll contexts, but
rather with the NAPI. The user program would need to reissue the
ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
and gro_flush_timeout settings would come from the NAPI's
napi_config (which are set either by sysfs or by netlink). Such an
interface seems awkard to use from a user perspective.

Further, IRQs are related to NAPIs, which is why they are stored in
the napi_config space. Putting the irq_suspend_timeout in
the epoll context while other IRQ deferral mechanisms remain in the
NAPI's napi_config space seems like an odd design choice.

We've opted to keep all of the IRQ deferral parameters together and
place the irq_suspend_timeout in napi_config. This has nice benefits
for userspace: if a user app were to remove all file descriptors
from an epoll context and add new file descriptors with a new NAPI ID,
the correct suspend timeout for that NAPI ID would be used automatically
without the user application needing to do anything (like re-issuing an
ioctl, for example). All IRQ deferral related parameters are in one
place and can all be set the same way: with netlink.

- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?

No. The problem is that the long timeout must engage if and only if
prefer-busy is active.

When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.

Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.

- Isn't it already possible to combine busy looping with irq deferral?

Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.

~ Special thanks

Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:

- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.

- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.

Thanks,
Martin and Joe

[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/libevent.patch
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results

v8: https://lore.kernel.org/20241108045337.292905-1-jdamato@fastly.com
v7: https://lore.kernel.org/20241108023912.98416-1-jdamato@fastly.com
v6: https://lore.kernel.org/20241104215542.215919-1-jdamato@fastly.com
v5: https://lore.kernel.org/20241103052421.518856-1-jdamato@fastly.com
v4: https://lore.kernel.org/20241102005214.32443-1-jdamato@fastly.com
v3: https://lore.kernel.org/20241101004846.32532-1-jdamato@fastly.com
v2: https://lore.kernel.org/20241021015311.95468-1-jdamato@fastly.com
====================

Link: https://patch.msgid.link/20241109050245.191288-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# 347fcdc4 09-Nov-2024 Joe Damato <jdamato@fastly.com>

selftests: net: Add busy_poll_test

Add an epoll busy poll test using netdevsim.

This test is comprised of:
- busy_poller (via busy_poller.c)
- busy_poll_test.sh which loads netdevsim, sets up n

selftests: net: Add busy_poll_test

Add an epoll busy poll test using netdevsim.

This test is comprised of:
- busy_poller (via busy_poller.c)
- busy_poll_test.sh which loads netdevsim, sets up network namespaces,
and runs busy_poller to receive data and socat to send data.

The selftest tests two different scenarios:
- busy poll (the pre-existing version in the kernel)
- busy poll with suspend enabled (what this series adds)

The data transmit is a 1MiB temporary file generated from /dev/urandom
and the test is considered passing if the md5sum of the input file to
socat matches the md5sum of the output file from busy_poller.

netdevsim was chosen instead of veth due to netdevsim's support for
netdev-genl.

For now, this test uses the functionality that netdevsim provides. In the
future, perhaps netdevsim can be extended to emulate device IRQs to more
thoroughly test all pre-existing kernel options (like defer_hard_irqs)
and suspend.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241109050245.191288-6-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


12345678910>>...45