net: switchdev: Convert blocking notification chain to a raw oneA blocking notification chain uses a read-write semaphore to protect theintegrity of the chain. The semaphore is acquired for writin
net: switchdev: Convert blocking notification chain to a raw oneA blocking notification chain uses a read-write semaphore to protect theintegrity of the chain. The semaphore is acquired for writing whenadding / removing notifiers to / from the chain and acquired for readingwhen traversing the chain and informing notifiers about an event.In case of the blocking switchdev notification chain, recursivenotifications are possible which leads to the semaphore being acquiredtwice for reading and to lockdep warnings being generated [1].Specifically, this can happen when the bridge driver processes aSWITCHDEV_BRPORT_UNOFFLOADED event which causes it to emit notificationsabout deferred events when calling switchdev_deferred_process().Fix this by converting the notification chain to a raw notificationchain in a similar fashion to the netdev notification chain. Protectthe chain using the RTNL mutex by acquiring it when modifying the chain.Events are always informed under the RTNL mutex, but add an assertion incall_switchdev_blocking_notifiers() to make sure this is not violated inthe future.Maintain the "blocking" prefix as events are always emitted from processcontext and listeners are allowed to block.[1]:WARNING: possible recursive locking detected6.14.0-rc4-custom-g079270089484 #1 Not tainted--------------------------------------------ip/52731 is trying to acquire lock:ffffffff850918d8 ((switchdev_blocking_notif_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x58/0xa0but task is already holding lock:ffffffff850918d8 ((switchdev_blocking_notif_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x58/0xa0other info that might help us debug this:Possible unsafe locking scenario:CPU0----lock((switchdev_blocking_notif_chain).rwsem);lock((switchdev_blocking_notif_chain).rwsem);*** DEADLOCK ***May be due to missing lock nesting notation3 locks held by ip/52731: #0: ffffffff84f795b0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x727/0x1dc0 #1: ffffffff8731f628 (&net->rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x790/0x1dc0 #2: ffffffff850918d8 ((switchdev_blocking_notif_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x58/0xa0stack backtrace:...? __pfx_down_read+0x10/0x10? __pfx_mark_lock+0x10/0x10? __pfx_switchdev_port_attr_set_deferred+0x10/0x10blocking_notifier_call_chain+0x58/0xa0switchdev_port_attr_notify.constprop.0+0xb3/0x1b0? __pfx_switchdev_port_attr_notify.constprop.0+0x10/0x10? mark_held_locks+0x94/0xe0? switchdev_deferred_process+0x11a/0x340switchdev_port_attr_set_deferred+0x27/0xd0switchdev_deferred_process+0x164/0x340br_switchdev_port_unoffload+0xc8/0x100 [bridge]br_switchdev_blocking_event+0x29f/0x580 [bridge]notifier_call_chain+0xa2/0x440blocking_notifier_call_chain+0x6e/0xa0switchdev_bridge_port_unoffload+0xde/0x1a0...Fixes: f7a70d650b0b6 ("net: bridge: switchdev: Ensure deferred event delivery on unoffload")Signed-off-by: Amit Cohen <amcohen@nvidia.com>Reviewed-by: Ido Schimmel <idosch@nvidia.com>Reviewed-by: Simon Horman <horms@kernel.org>Reviewed-by: Vladimir Oltean <olteanv@gmail.com>Tested-by: Vladimir Oltean <olteanv@gmail.com>Link: https://patch.msgid.link/20250305121509.631207-1-amcohen@nvidia.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
net: bridge: switchdev: Improve error message for port_obj_add/del functionsEnhance the error reporting mechanism in the switchdev framework toprovide more informative and user-friendly error mess
net: bridge: switchdev: Improve error message for port_obj_add/del functionsEnhance the error reporting mechanism in the switchdev framework toprovide more informative and user-friendly error messages.Following feedback from users struggling to understand the implicationsof error messages like "failed (err=-28) to add object (id=2)", thisupdate aims to clarify what operation failed and how this might impactthe system or network.With this change, error messages now include a description of the failedoperation, the specific object involved, and a brief explanation of thepotential impact on the system. This approach helps administrators anddevelopers better understand the context and severity of errors,facilitating quicker and more effective troubleshooting.Example of the improved logging:[ 70.516446] ksz-switch spi0.0 uplink: Failed to add Port Multicast Database entry (object id=2) with error: -ENOSPC (-28).[ 70.516446] Failure in updating the port's Multicast Database could lead to multicast forwarding issues.[ 70.516446] Current HW/SW setup lacks sufficient resources.This comprehensive update includes handling for a range of switchdevobject IDs, ensuring that most operations within the switchdev frameworkbenefit from clearer error reporting.Reviewed-by: Simon Horman <horms@kernel.org>Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: switchdev: Skip MDB replays of deferred events on offloadBefore this change, generation of the list of MDB events to replaywould race against the creation of new group memberships, ei
net: bridge: switchdev: Skip MDB replays of deferred events on offloadBefore this change, generation of the list of MDB events to replaywould race against the creation of new group memberships, either fromthe IGMP/MLD snooping logic or from user configuration.While new memberships are immediately visible to walkers ofbr->mdb_list, the notification of their existence to switchdev eventsubscribers is deferred until a later point in time. So if a replaylist was generated during a time that overlapped with such a window,it would also contain a replay of the not-yet-delivered event.The driver would thus receive two copies of what the bridge internallyconsidered to be one single event. On destruction of the bridge, onlya single membership deletion event was therefore sent. As aconsequence of this, drivers which reference count memberships (atleast DSA), would be left with orphan groups in their hardwaredatabase when the bridge was destroyed.This is only an issue when replaying additions. While deletion eventsmay still be pending on the deferred queue, they will already havebeen removed from br->mdb_list, so no duplicates can be generated inthat scenario.To a user this meant that old group memberships, from a bridge inwhich a port was previously attached, could be reanimated (inhardware) when the port joined a new bridge, without the new bridge'sknowledge.For example, on an mv88e6xxx system, create a snooping bridge andimmediately add a port to it: root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ > ip link set dev x3 up master br0And then destroy the bridge: root@infix-06-0b-00:~$ ip link del dev br0 root@infix-06-0b-00:~$ mvls atu ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a DEV:0 Marvell 88E6393X 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a root@infix-06-0b-00:~$The two IPv6 groups remain in the hardware database because theport (x3) is notified of the host's membership twice: once via theoriginal event and once via a replay. Since only a single deletenotification is sent, the count remains at 1 when the bridge isdestroyed.Then add the same port (or another port belonging to the same hardwaredomain) to a new bridge, this time with snooping disabled: root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \ > ip link set dev x3 up master br1All multicast, including the two IPv6 groups from br0, should now beflooded, according to the policy of br1. But instead the oldmemberships are still active in the hardware database, causing theswitch to only forward traffic to those groups towards the CPU (port0).Eliminate the race in two steps:1. Grab the write-side lock of the MDB while generating the replay list.This prevents new memberships from showing up while we are generatingthe replay list. But it leaves the scenario in which a deferred eventwas already generated, but not delivered, before we grabbed thelock. Therefore:2. Make sure that no deferred version of a replay event is already enqueued to the switchdev deferred queue, before adding it to the replay list, when replaying additions.Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>Reviewed-by: Vladimir Oltean <olteanv@gmail.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: Add a helper to replay objects on a bridge portWhen a front panel joins a bridge via another netdevice (typically a LAG),the driver needs to learn about the objects configured on t
net: switchdev: Add a helper to replay objects on a bridge portWhen a front panel joins a bridge via another netdevice (typically a LAG),the driver needs to learn about the objects configured on the bridge port.When the bridge port is offloaded by the driver for the first time, thiscan be achieved by passing a notifier to switchdev_bridge_port_offload().The notifier is then invoked for the individual objects (such as VLANs)configured on the bridge, and can look for the interesting ones.Calling switchdev_bridge_port_offload() when the second port joins thebridge lower is unnecessary, but the replay is still needed. To that end,add a new function, switchdev_bridge_port_replay(), which does only thereplay part of the _offload() function in exactly the same way as thatfunction.Cc: Jiri Pirko <jiri@resnulli.us>Cc: Ivan Vecera <ivecera@redhat.com>Cc: Roopa Prabhu <roopa@nvidia.com>Cc: Nikolay Aleksandrov <razor@blackwall.org>Cc: bridge@lists.linux-foundation.orgSigned-off-by: Petr Machata <petrm@nvidia.com>Reviewed-by: Danielle Ratson <danieller@nvidia.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: rename reference+tracking helpersNetdev reference helpers have a dev_ prefix for historicreasons. Renaming the old helpers would be too much churnbut we can rename the tracking ones which ar
net: rename reference+tracking helpersNetdev reference helpers have a dev_ prefix for historicreasons. Renaming the old helpers would be too much churnbut we can rename the tracking ones which are relativelyrecent and should be the default for new code.Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace()Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
net: switchdev: remove lag_mod_cb from switchdev_handle_fdb_event_to_deviceWhen the switchdev_handle_fdb_event_to_device() event replication helperwas created, my original thought was that FDB eve
net: switchdev: remove lag_mod_cb from switchdev_handle_fdb_event_to_deviceWhen the switchdev_handle_fdb_event_to_device() event replication helperwas created, my original thought was that FDB events on LAG interfacesshould most likely be special-cased, not just replicated towards allswitchdev ports beneath that LAG. So this replication helper currentlydoes not recurse through switchdev lower interfaces of LAG bridge ports,but rather calls the lag_mod_cb() if that was provided.No switchdev driver uses this helper for FDB events on LAG interfacesyet, so that was an assumption which was yet to be tested. It iscertainly usable for that purpose, as my RFC series shows:https://patchwork.kernel.org/project/netdevbpf/cover/20220210125201.2859463-1-vladimir.oltean@nxp.com/however this approach is slightly convoluted because:- the switchdev driver gets a "dev" that isn't its own net device, but rather the LAG net device. It must call switchdev_lower_dev_find(dev) in order to get a handle of any of its own net devices (the ones that pass check_cb).- in order for FDB entries on LAG ports to be correctly refcounted per the number of switchdev ports beneath that LAG, we haven't escaped the need to iterate through the LAG's lower interfaces. Except that is now the responsibility of the switchdev driver, because the replication helper just stopped half-way.So, even though yes, FDB events on LAG bridge ports must bespecial-cased, in the end it's simpler to let switchdev_handle_fdb_*just iterate through the LAG port's switchdev lowers, and let theswitchdev driver figure out that those physical ports are under a LAG.The switchdev_handle_fdb_event_to_device() helper takes a"foreign_dev_check" callback so it can figure out whether @dev canautonomously forward to @foreign_dev. DSA fills this method properly:if the LAG is offloaded by another port in the same tree as @dev, thenit isn't foreign. If it is a software LAG, it is foreign - forwardinghappens in software.Whether an interface is foreign or not decides whether the replicationhelper will go through the LAG's switchdev lowers or not. Since thelan966x doesn't properly fill this out, FDB events on software LAGuppers will get called. By changing lan966x_foreign_dev_check(), we cansuppress them.Whereas DSA will now start receiving FDB events for its offloaded LAGuppers, so we need to return -EOPNOTSUPP, since we currently don't dothe right thing for them.Cc: Horatiu Vultur <horatiu.vultur@microchip.com>Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: switchdev: avoid infinite recursion from LAG to bridge with port object handlerThe logic from switchdev_handle_port_obj_add_foreign() is directlyadapted from switchdev_handle_fdb_event_to_dev
net: switchdev: avoid infinite recursion from LAG to bridge with port object handlerThe logic from switchdev_handle_port_obj_add_foreign() is directlyadapted from switchdev_handle_fdb_event_to_device(), which alreadydetects events on foreign interfaces and reoffloads them towards theswitchdev neighbors.However, when we have a simple br0 <-> bond0 <-> swp0 topology and theswitchdev_handle_port_obj_add_foreign() gets called on bond0, we getstuck into an infinite recursion:1. bond0 does not pass check_cb(), so we attempt to find switchdev neighbor interfaces. For that, we recursively call __switchdev_handle_port_obj_add() for bond0's bridge, br0.2. __switchdev_handle_port_obj_add() recurses through br0's lowers, essentially calling __switchdev_handle_port_obj_add() for bond03. Go to step 1.This happens because switchdev_handle_fdb_event_to_device() andswitchdev_handle_port_obj_add_foreign() are not exactly the same.The FDB event helper special-cases LAG interfaces with its lag_mod_cb(),so this is why we don't end up in an infinite loop - because it doesn'tattempt to treat LAG interfaces as potentially foreign bridge ports.The problem is solved by looking ahead through the bridge's lowers tosee whether there is any switchdev interface that is foreign to the @devwe are currently processing. This stops the recursion described above atstep 1: __switchdev_handle_port_obj_add(bond0) will not create anothercall to __switchdev_handle_port_obj_add(br0). Going one step uppershould only happen when we're starting from a bridge port that has beendetermined to be "foreign" to the switchdev driver that passes theforeign_dev_check_cb().Fixes: c4076cdd21f8 ("net: switchdev: introduce switchdev_handle_port_obj_{add,del} for foreign interfaces")Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: introduce switchdev_handle_port_obj_{add,del} for foreign interfacesThe switchdev_handle_port_obj_add() helper is good for replicating aport object on the lower interfaces of @dev,
net: switchdev: introduce switchdev_handle_port_obj_{add,del} for foreign interfacesThe switchdev_handle_port_obj_add() helper is good for replicating aport object on the lower interfaces of @dev, if that object was emittedon a bridge, or on a bridge port that is a LAG.However, drivers that use this helper limit themselves to a box fromwhich they can no longer intercept port objects notified on neighborports ("foreign interfaces").One such driver is DSA, where software bridging with foreign interfacessuch as standalone NICs or Wi-Fi APs is an important use case. There, aVLAN installed on a neighbor bridge port roughly corresponds to aforwarding VLAN installed on the DSA switch's CPU port.To support this use case while also making use of the benefits of theswitchdev_handle_* replication helper for port objects, introduce a newvariant of these functions that crawls through the neighbor ports of@dev, in search of potentially compatible switchdev ports that areinterested in the event.The strategy is identical to switchdev_handle_fdb_event_to_device():if @dev wasn't a switchdev interface, then go one step upper, andrecursively call this function on the bridge that this port belongs to.At the next recursion step, __switchdev_handle_port_obj_add() williterate through the bridge's lower interfaces. Among those, some will beswitchdev interfaces, and one will be the original @dev that we camefrom. To prevent infinite recursion, we must suppress reentry into theoriginal @dev, and just call the @add_cb for the switchdev_interfaces.It looks like this: br0 / | \ / | \ / | \ swp0 swp1 eth01. __switchdev_handle_port_obj_add(eth0) -> check_cb(eth0) returns false -> eth0 has no lower interfaces -> eth0's bridge is br0 -> switchdev_lower_dev_find(br0, check_cb, foreign_dev_check_cb)) finds br02. __switchdev_handle_port_obj_add(br0) -> check_cb(br0) returns false -> netdev_for_each_lower_dev -> check_cb(swp0) returns true, so we don't skip this interface3. __switchdev_handle_port_obj_add(swp0) -> check_cb(swp0) returns true, so we call add_cb(swp0)(back to netdev_for_each_lower_dev from 2) -> check_cb(swp1) returns true, so we don't skip this interface4. __switchdev_handle_port_obj_add(swp1) -> check_cb(swp1) returns true, so we call add_cb(swp1)(back to netdev_for_each_lower_dev from 2) -> check_cb(eth0) returns false, so we skip this interface to avoid infinite recursionNote: eth0 could have been a LAG, and we don't want to suppress therecursion through its lowers if those exist, so when check_cb() returnsfalse, we still call switchdev_lower_dev_find() to estimate whetherthere's anything worth a recursion beneath that LAG. Using check_cb()and foreign_dev_check_cb(), switchdev_lower_dev_find() not only figuresout whether the lowers of the LAG are switchdev, but also whether theyactively offload the LAG or not (whether the LAG is "foreign" to theswitchdev interface or not).The port_obj_info->orig_dev is preserved across recursive calls, soswitchdev drivers still know on which device was this notificationoriginally emitted.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: rename switchdev_lower_dev_find to switchdev_lower_dev_find_rcuswitchdev_lower_dev_find() assumes RCU read-side critical sectioncalling context, since it uses netdev_walk_all_lower
net: switchdev: rename switchdev_lower_dev_find to switchdev_lower_dev_find_rcuswitchdev_lower_dev_find() assumes RCU read-side critical sectioncalling context, since it uses netdev_walk_all_lower_dev_rcu().Rename it appropriately, in preparation of adding a similar iteratorthat assumes writer-side rtnl_mutex protection.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net/switchdev: use struct_size over open coded arithmeticReplace zero-length array with flexible-array member and make useof the struct_size() helper in kmalloc(). For example:struct switchdev_d
net/switchdev: use struct_size over open coded arithmeticReplace zero-length array with flexible-array member and make useof the struct_size() helper in kmalloc(). For example:struct switchdev_deferred_item { ... unsigned long data[];};Make use of the struct_size() helper instead of an open-coded versionin order to avoid any potential type mistakes.Reported-by: Zeal Robot <zealci@zte.com.cn>Signed-off-by: Minghao Chi (CGEL ZTE) <chi.minghao@zte.com.cn>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: add net device refcount trackerSigned-off-by: Eric Dumazet <edumazet@google.com>Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: switchdev: merge switchdev_handle_fdb_{add,del}_to_deviceTo reduce code churn, the same patch makes multiple changes, since theyall touch the same lines:1. The implementations for these two
net: switchdev: merge switchdev_handle_fdb_{add,del}_to_deviceTo reduce code churn, the same patch makes multiple changes, since theyall touch the same lines:1. The implementations for these two are identical, just with different function pointers. Reduce duplications and name the function pointers "mod_cb" instead of "add_cb" and "del_cb". Pass the event as argument.2. Drop the "const" attribute from "orig_dev". If the driver needs to check whether orig_dev belongs to itself and then call_switchdev_notifiers(orig_dev, SWITCHDEV_FDB_OFFLOADED), it can't, because call_switchdev_notifiers takes a non-const struct net_device *.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Reviewed-by: Ido Schimmel <idosch@nvidia.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridgeWith the introduction of explicit offloading API in switchdev in commit2f5dc00f7a3e ("net: bridge: switchdev: let driver
net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridgeWith the introduction of explicit offloading API in switchdev in commit2f5dc00f7a3e ("net: bridge: switchdev: let drivers inform which bridgeports are offloaded"), we started having Ethernet switch drivers callingdirectly into a function exported by net/bridge/br_switchdev.c, which isa function exported by the bridge driver.This means that drivers that did not have an explicit dependency on thebridge before, like cpsw and am65-cpsw, now do - otherwise it is notpossible to call a symbol exported by a driver that can be built asmodule unless you are a module too.There was an attempt to solve the dependency issue in the form of commitb0e81817629a ("net: build all switchdev drivers as modules when thebridge is a module"). Grygorii Strashko, however, says about it:| In my opinion, the problem is a bit bigger here than just fixing the| build :(|| In case, of ^cpsw the switchdev mode is kinda optional and in many| cases (especially for testing purposes, NFS) the multi-mac mode is| still preferable mode.|| There were no such tight dependency between switchdev drivers and| bridge core before and switchdev serviced as independent, notification| based layer between them, so ^cpsw still can be "Y" and bridge can be| "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE| will need to be set as "Y", or we will have to update drivers to| support build with BRIDGE=n and maintain separate builds for| networking vs non-networking testing. But is this enough? Wouldn't| it cause 'chain reaction' required to add more and more "Y" options| (like CONFIG_VLAN_8021Q)?|| PS. Just to be sure we on the same page - ARM builds will be forced| (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our| automation testing will just fail with omap2plus_defconfig.In the light of this, it would be desirable for some configurations toavoid dependencies between switchdev drivers and the bridge, and havethe switchdev mode as completely optional within the driver.Arnd Bergmann also tried to write a patch which better expressed thebuild time dependency for Ethernet switch drivers where the switchdevsupport is optional, like cpsw/am65-cpsw, and this made the driversfollow the bridge (compile as module if the bridge is a module) only ifthe optional switchdev support in the driver was enabled in the firstplace:https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/but this still did not solve the fact that cpsw and am65-cpsw now mustbe built as modules when the bridge is a module - it just expressedcorrectly that optional dependency. But the new behavior is an apparentregression from Grygorii's perspective.So to support the use case where the Ethernet driver is built-in,NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, weneed a framework that can handle the possible absence of the bridge fromthe running system, i.e. runtime bloatware as opposed to build-timebloatware.Luckily we already have this framework, since switchdev has been usingit extensively. Events from the bridge side are transmitted to thedriver side using notifier chains - this was originally done so thatunrelated drivers could snoop for events emitted by the bridge towardsports that are implemented by other drivers (think of a switch driverwith LAG offload that listens for switchdev events on a bonding/teaminterface that it offloads).There are also events which are transmitted from the driver side to thebridge side, which again are modeled using notifiers.SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals withnotifying the bridge that a MAC address has been dynamically learned.So there is a precedent we can use for modeling the new framework.The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the workthat the bridge needs to do when a port becomes offloaded is blocking inits nature: replay VLANs, MDBs etc. The calling context is indeedblocking (we are under rtnl_mutex), but the existing switchdevnotification chain that the bridge is subscribed to is only the atomicone. So we need to subscribe the bridge to the blocking switchdevnotification chain too.This patch:- keeps the driver-side perception of the switchdev_bridge_port_{,un}offload unchanged- moves the implementation of switchdev_bridge_port_{,un}offload from the bridge module into the switchdev module.- makes everybody that is subscribed to the switchdev blocking notifier chain "hear" offload & unoffload events- makes the bridge driver subscribe and handle those events- moves the bridge driver's handling of those events into 2 new functions called br_switchdev_port_{,un}offload. These functions contain in fact the core of the logic that was previously in switchdev_bridge_port_{,un}offload, just that now we go through an extra indirection layer to reach them.Unlike all the other switchdev notification structures, the structureused to carry the bridge port information, structswitchdev_notifier_brport_info, does not contain a "bool handled".This is because in the current usage pattern, we always know that aswitchdev bridge port offloading event will be handled by the bridge,because the switchdev_bridge_port_offload() call was initiated by aNETDEV_CHANGEUPPER event in the first place, where info->upper_dev is abridge. So if the bridge wasn't loaded, then the CHANGEUPPER eventcouldn't have happened.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: fix FDB entries towards foreign ports not getting propagated to usThe newly introduced switchdev_handle_fdb_{add,del}_to_device helperssolved a problem but introduced another one.
net: switchdev: fix FDB entries towards foreign ports not getting propagated to usThe newly introduced switchdev_handle_fdb_{add,del}_to_device helperssolved a problem but introduced another one. They have a severe designbug: they do not propagate FDB events on foreign interfaces to us, i.e.this use case: br0 / \ / \ / \ / \ swp0 eno0(switchdev) (foreign)when an address is learned on eno0, what is supposed to happen is thatthis event should also be propagated towards swp0. Somehow I managed toconvince myself that this did work correctly, but obviously it does not.The trouble with foreign interfaces is that we must reach a switchdevnet_device pointer through a foreign net_device that has no directupper/lower relationship with it. So we need to do exploratory searchingthrough the lower interfaces of the foreign net_device's bridge upper(to reach swp0 from eno0, we must check its upper, br0, for lowerinterfaces that pass the check_cb and foreign_dev_check_cb). This issomething that the previous code did not do, it just assumed that "dev"will become a switchdev interface at some point, somehow, probably bymagic.With this patch, assisted address learning on the CPU port works againin DSA:ip link add br0 type bridgeip link set swp0 master br0ip link set eno0 master br0ip link set br0 up[ 46.708929] mscc_felix 0000:00:00.5 swp0: Adding FDB entry towards eno0, addr 00:04:9f:05:f4:ab vid 0 as host addressFixes: 8ca07176ab00 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")Reported-by: Eric Woudstra <ericwouds@gmail.com>Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: recurse into __switchdev_handle_fdb_del_to_deviceThe difference between __switchdev_handle_fdb_del_to_device andswitchdev_handle_del_to_device is that the former takes an extraori
net: switchdev: recurse into __switchdev_handle_fdb_del_to_deviceThe difference between __switchdev_handle_fdb_del_to_device andswitchdev_handle_del_to_device is that the former takes an extraorig_dev argument, while the latter starts with dev == orig_dev.We should recurse into the variant that does not lose the orig_dev alongthe way. This is relevant when deleting FDB entries pointing towards abridge (dev changes to the lower interfaces, but orig_dev shouldn't).The addition helper already recurses properly, just the deletion onedoesn't.Fixes: 8ca07176ab00 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICECurrently DSA has an issue with FDB entries pointing towards the bridgein the presence of br_fdb_replay() being calle
net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICECurrently DSA has an issue with FDB entries pointing towards the bridgein the presence of br_fdb_replay() being called at port join and leavetime.In particular, each bridge port will ask for a replay for the FDBentries pointing towards the bridge when it joins, and for anotherreplay when it leaves.This means that for example, a bridge with 4 switch ports will notifyDSA 4 times of the bridge MAC address.But if the MAC address of the bridge changes during the normal runtimeof the system, the bridge notifies switchdev [ once ] of the deletion ofthe old MAC address as a local FDB towards the bridge, and of theinsertion [ again once ] of the new MAC address as a local FDB.This is a problem, because DSA keeps the old MAC address as a host FDBentry with refcount 4 (4 ports asked for it using br_fdb_replay). So theold MAC address will not be deleted. Additionally, the new MAC addresswill only be installed with refcount 1, and when the first switch portleaves the bridge (leaving 3 others as still members), it will deletewith it the new MAC address of the bridge from the local FDB entrieskept by DSA (because the br_fdb_replay call on deletion will bring theentry's refcount from 1 to 0).So the problem, really, is that the number of br_fdb_replay() calls isnot matched with the refcount that a host FDB is offloaded to DSA duringnormal runtime.An elegant way to solve the problem would be to make the switchdevnotification emitted by br_fdb_change_mac_address() result in a host FDBkept by DSA which has a refcount exactly equal to the number of portsunder that bridge. Then, no matter how many DSA ports join or leave thatbridge, the host FDB entry will always be deleted when there are exactlyzero remaining DSA switch ports members of the bridge.To implement the proposed solution, we remember that the switchdevobjects and port attributes have some helpers provided by switchdev,which can be optionally called by drivers:switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set.These helpers:- fan out a switchdev object/attribute emitted for the bridge towards all the lower interfaces that pass the check_cb().- fan out a switchdev object/attribute emitted for a bridge port that is a LAG towards all the lower interfaces that pass the check_cb().In other words, this is the model we need for the FDB events too:something that will keep an FDB entry emitted towards a physical port asit is, but translate an FDB entry emitted towards the bridge into N FDBentries, one per physical port.Of course, there are many differences between fanning out a switchdevobject (VLAN) on 3 lower interfaces of a LAG and fanning out an FDBentry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towardsa LAG should be treated specially, because FDB entries are unicast, wecan't just install the same address towards 3 destinations. It isimaginable that drivers might want to treat this case specifically, socreate some methods for this case and do not recurse into the LAG lowerports, just the bridge ports.DSA also listens for FDB entries on "foreign" interfaces, aka interfacesbridged with us which are not part of our hardware domain: think anEthernet switch bridged with a Wi-Fi AP. For those addresses, DSAinstalls host FDB entries. However, there we have the same problem(those host FDB entries are installed with a refcount of only 1) and aneven bigger one which we did not have with FDB entries towards thebridge:br_fdb_replay() is currently not called for FDB entries on foreigninterfaces, just for the physical port and for the bridge itself.So when DSA sniffs an address learned by the software bridge towards aforeign interface like an e1000 port, and then that e1000 leaves thebridge, DSA remains with the dangling host FDB address. That will befixed separately by replaying all FDB entries and not just the onestowards the port and the bridge.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: add a context void pointer to struct switchdev_notifier_infoIn the case where the driver asks for a replay of a certain type ofevent (port object or attribute) for a bridge port th
net: switchdev: add a context void pointer to struct switchdev_notifier_infoIn the case where the driver asks for a replay of a certain type ofevent (port object or attribute) for a bridge port that is a LAG, it maydo so because this port has just joined the LAG.But there might already be other switchdev ports in that LAG, and it ispreferable that those preexisting switchdev ports do not act upon thereplayed event.The solution is to add a context to switchdev events, which is NULL mostof the time (when the bridge layer initiates the call) but which can beset to a value controlled by the switchdev driver when a replay isrequested. The driver can then check the context to figure out if allports within the LAG should act upon the switchdev event, or just theones that match the context.We have to modify all switchdev_handle_* helper functions as well as theprototypes in the drivers that use these helpers too, because thesehelpers hide the underlying struct switchdev_notifier_info from us andthere is no way to retrieve the context otherwise.The context structure will be populated and used in later patches.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: propagate extack through switchdev_port_attr_setThe benefit is the ability to propagate errors from switchdev driversfor the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING andSWITCHDEV_ATTR
net: bridge: propagate extack through switchdev_port_attr_setThe benefit is the ability to propagate errors from switchdev driversfor the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING andSWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL attributes.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>Signed-off-by: David S. Miller <davem@davemloft.net>
net: switchdev: propagate extack to port attributesWhen a struct switchdev_attr is notified through switchdev, there is noway to report informational messages, unlike for struct switchdev_obj.Si
net: switchdev: propagate extack to port attributesWhen a struct switchdev_attr is notified through switchdev, there is noway to report informational messages, unlike for struct switchdev_obj.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Reviewed-by: Ido Schimmel <idosch@nvidia.com>Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>Signed-off-by: David S. Miller <davem@davemloft.net>
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netdrivers/net/can/dev.c b552766c872f ("can: dev: prevent potential information leak in can_fill_info()") 3e77f70e7345 ("can: dev: mov
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netdrivers/net/can/dev.c b552766c872f ("can: dev: prevent potential information leak in can_fill_info()") 3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir") 0a042c6ec991 ("can: dev: move netlink related code into seperate file") Code move.drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 57ac4a31c483 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down") 214baf22870c ("net/mlx5e: Support HTB offload") Adjacent code changesnet/switchdev/switchdev.c 20776b465c0c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP") ffb68fc58e96 ("net: switchdev: remove the transaction structure from port object notifiers") bae33f2b5afe ("net: switchdev: remove the transaction structure from port attributes") Transaction parameter gets dropped otherwise keep the fix.Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: switchdev: use obj-$(CONFIG_NET_SWITCHDEV) form in net/MakefileCONFIG_NET_SWITCHDEV is a bool option. Change the ifeq conditional tothe standard obj-$(CONFIG_NET_SWITCHDEV) form.Use obj-y i
net: switchdev: use obj-$(CONFIG_NET_SWITCHDEV) form in net/MakefileCONFIG_NET_SWITCHDEV is a bool option. Change the ifeq conditional tothe standard obj-$(CONFIG_NET_SWITCHDEV) form.Use obj-y in net/switchdev/Makefile because Kbuild visits this Makefileonly when CONFIG_NET_SWITCHDEV=y.Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>Link: https://lore.kernel.org/r/20210125231659.106201-3-masahiroy@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPPIt's not true that switchdev_port_obj_notify() only inspects the->handled field of "struct switchdev_notifier_port_obj_info" i
net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPPIt's not true that switchdev_port_obj_notify() only inspects the->handled field of "struct switchdev_notifier_port_obj_info" ifcall_switchdev_blocking_notifiers() returns 0 - there's a WARN_ON()triggering for a non-zero return combined with ->handled not beingtrue. But the real problem here is that -EOPNOTSUPP is not beingproperly handled.The wrapper functions switchdev_handle_port_obj_add() et al change areturn value of -EOPNOTSUPP to 0, and the treatment of ->handled inswitchdev_port_obj_notify() seems to be designed to change that backto -EOPNOTSUPP in case nobody actually acted on the notifier (i.e.,everybody returned -EOPNOTSUPP).Currently, as soon as some device down the stack passes the check_cb()check, ->handled gets set to true, which means thatswitchdev_port_obj_notify() cannot actually ever return -EOPNOTSUPP.This, for example, means that the detection of hardware offloadsupport in the MRP code is broken: switchdev_port_obj_add() used bybr_mrp_switchdev_send_ring_test() always returns 0, so since the MRPcode thinks the generation of MRP test frames has been offloaded, nosuch frames are actually put on the wire. Similarly,br_mrp_switchdev_set_ring_role() also always returns 0, causingmrp->ring_role_offloaded to be set to 1.To fix this, continue to set ->handled true if any callback returnssuccess or any error distinct from -EOPNOTSUPP. But if all thecallbacks return -EOPNOTSUPP, make sure that ->handled stays false, sothe logic in switchdev_port_obj_notify() can propagate thatinformation.Fixes: 9a9f26e8f7ea ("bridge: mrp: Connect MRP API with the switchdev API")Fixes: f30f0601eb93 ("switchdev: Add helpers to aid traversal through lower devices")Reviewed-by: Petr Machata <petrm@nvidia.com>Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>Link: https://lore.kernel.org/r/20210125124116.102928-1-rasmus.villemoes@prevas.dkSigned-off-by: Jakub Kicinski <kuba@kernel.org>
net: switchdev: remove the transaction structure from port attributesSince the introduction of the switchdev API, port attributes weretransmitted to drivers for offloading using a two-step transac
net: switchdev: remove the transaction structure from port attributesSince the introduction of the switchdev API, port attributes weretransmitted to drivers for offloading using a two-step transactionalmodel, with a prepare phase that was supposed to catch all errors, and acommit phase that was supposed to never fail.Some classes of failures can never be avoided, like hardware access, ormemory allocation. In the latter case, merely attempting to move thememory allocation to the preparation phase makes it impossible to avoidmemory leaks, since commit 91cf8eceffc1 ("switchdev: Remove unusedtransaction item queue") which has removed the unused mechanism ofpassing on the allocated memory between one phase and another.It is time we admit that separating the preparation from the commitphase is something that is best left for the driver to decide, and notsomething that should be baked into the API, especially since there areno switchdev callers that depend on this.This patch removes the struct switchdev_trans member from switchdev portattribute notifier structures, and converts drivers to not look at thismember.In part, this patch contains a revert of my previous commit 2e554a7a5d8a("net: dsa: propagate switchdev vlan_filtering prepare phase todrivers").For the most part, the conversion was trivial except for:- Rocker's world implementation based on Broadcom OF-DPA had an odd implementation of ofdpa_port_attr_bridge_flags_set. The conversion was done mechanically, by pasting the implementation twice, then only keeping the code that would get executed during prepare phase on top, then only keeping the code that gets executed during the commit phase on bottom, then simplifying the resulting code until this was obtained.- DSA's offloading of STP state, bridge flags, VLAN filtering and multicast router could be converted right away. But the ageing time could not, so a shim was introduced and this was left for a further commit.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Acked-by: Linus Walleij <linus.walleij@linaro.org>Acked-by: Jiri Pirko <jiri@nvidia.com>Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreekReviewed-by: Linus Walleij <linus.walleij@linaro.org> # RTL8366RBReviewed-by: Ido Schimmel <idosch@nvidia.com>Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: switchdev: delete switchdev_port_obj_add_nowAfter the removal of the transactional model insideswitchdev_port_obj_add_now, it has no added value and we can just callswitchdev_port_obj_notify
net: switchdev: delete switchdev_port_obj_add_nowAfter the removal of the transactional model insideswitchdev_port_obj_add_now, it has no added value and we can just callswitchdev_port_obj_notify directly, bypassing this function. Let'sdelete it.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>Reviewed-by: Ido Schimmel <idosch@nvidia.com>Acked-by: Linus Walleij <linus.walleij@linaro.org>Acked-by: Jiri Pirko <jiri@nvidia.com>Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: switchdev: remove the transaction structure from port object notifiersSince the introduction of the switchdev API, port objects weretransmitted to drivers for offloading using a two-step tran
net: switchdev: remove the transaction structure from port object notifiersSince the introduction of the switchdev API, port objects weretransmitted to drivers for offloading using a two-step transactionalmodel, with a prepare phase that was supposed to catch all errors, and acommit phase that was supposed to never fail.Some classes of failures can never be avoided, like hardware access, ormemory allocation. In the latter case, merely attempting to move thememory allocation to the preparation phase makes it impossible to avoidmemory leaks, since commit 91cf8eceffc1 ("switchdev: Remove unusedtransaction item queue") which has removed the unused mechanism ofpassing on the allocated memory between one phase and another.It is time we admit that separating the preparation from the commitphase is something that is best left for the driver to decide, and notsomething that should be baked into the API, especially since there areno switchdev callers that depend on this.This patch removes the struct switchdev_trans member from switchdev portobject notifier structures, and converts drivers to not look at thismember.Where driver conversion is trivial (like in the case of the MarvellPrestera driver, NXP DPAA2 switch, TI CPSW, and Rocker drivers), it isdone in this patch.Where driver conversion needs more attention (DSA, Mellanox Spectrum),the conversion is left for subsequent patches and here we only fake theprepare/commit phases at a lower level, just not in the switchdevnotifier itself.Where the code has a natural structure that is best left alone as apreparation and a commit phase (as in the case of the Ocelot switch),that structure is left in place, just made to not depend upon theswitchdev transactional model.Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>Acked-by: Linus Walleij <linus.walleij@linaro.org>Acked-by: Jiri Pirko <jiri@nvidia.com>Reviewed-by: Ido Schimmel <idosch@nvidia.com>Signed-off-by: Jakub Kicinski <kuba@kernel.org>
123456