1.. SPDX-License-Identifier: GPL-2.0 2 3===================================== 4Network Devices, the Kernel, and You! 5===================================== 6 7 8Introduction 9============ 10The following is a random collection of documentation regarding 11network devices. It is intended for driver developers. 12 13struct net_device lifetime rules 14================================ 15Network device structures need to persist even after module is unloaded and 16must be allocated with alloc_netdev_mqs() and friends. 17If device has registered successfully, it will be freed on last use 18by free_netdev(). This is required to handle the pathological case cleanly 19(example: ``rmmod mydriver </sys/class/net/myeth/mtu``) 20 21alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver 22private data which gets freed when the network device is freed. If 23separately allocated data is attached to the network device 24(extra pointers stored in the device private struct) then it is up 25to the module exit handler to free that. 26 27There are two groups of APIs for registering struct net_device. 28First group can be used in normal contexts where ``rtnl_lock`` is not already 29held: register_netdev(), unregister_netdev(). 30Second group can be used when ``rtnl_lock`` is already held: 31register_netdevice(), unregister_netdevice(), free_netdev(). 32 33Simple drivers 34-------------- 35 36Most drivers (especially device drivers) handle lifetime of struct net_device 37in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths). 38 39In that case the struct net_device registration is done using 40the register_netdev(), and unregister_netdev() functions: 41 42.. code-block:: c 43 44 int probe() 45 { 46 struct my_device_priv *priv; 47 int err; 48 49 dev = alloc_netdev_mqs(...); 50 if (!dev) 51 return -ENOMEM; 52 priv = netdev_priv(dev); 53 54 /* ... do all device setup before calling register_netdev() ... 55 */ 56 57 err = register_netdev(dev); 58 if (err) 59 goto err_undo; 60 61 /* net_device is visible to the user! */ 62 return 0; 63 64 err_undo: 65 /* ... undo the device setup ... */ 66 free_netdev(dev); 67 return err; 68 } 69 70 void remove() 71 { 72 unregister_netdev(dev); 73 free_netdev(dev); 74 } 75 76Note that after calling register_netdev() the device is visible in the system. 77Users can open it and start sending / receiving traffic immediately, 78or run any other callback, so all initialization must be **complete** prior to 79registration. 80 81unregister_netdev() closes the device and waits for all users to be done 82with it. The memory of struct net_device itself may still be referenced 83by sysfs but all operations on that device will fail. 84 85free_netdev() can be called after unregister_netdev() returns or when 86register_netdev() failed. 87 88Device management under RTNL 89---------------------------- 90 91Registering struct net_device while in context which already holds 92the ``rtnl_lock`` requires extra care. In those scenarios most drivers 93will want to make use of struct net_device's ``needs_free_netdev`` 94and ``priv_destructor`` members for freeing of state. 95 96Example flow of netdev handling under ``rtnl_lock``: 97 98.. code-block:: c 99 100 static void my_setup(struct net_device *dev) 101 { 102 dev->needs_free_netdev = true; 103 } 104 105 static void my_destructor(struct net_device *dev) 106 { 107 some_obj_destroy(priv->obj); 108 some_uninit(priv); 109 } 110 111 int create_link() 112 { 113 struct my_device_priv *priv; 114 int err; 115 116 ASSERT_RTNL(); 117 118 dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup); 119 if (!dev) 120 return -ENOMEM; 121 priv = netdev_priv(dev); 122 123 /* Implicit constructor */ 124 err = some_init(priv); 125 if (err) 126 goto err_free_dev; 127 128 priv->obj = some_obj_create(); 129 if (!priv->obj) { 130 err = -ENOMEM; 131 goto err_some_uninit; 132 } 133 /* End of constructor, set the destructor: */ 134 dev->priv_destructor = my_destructor; 135 136 err = register_netdevice(dev); 137 if (err) 138 /* register_netdevice() calls destructor on failure */ 139 goto err_free_dev; 140 141 /* If anything fails now unregister_netdevice() (or unregister_netdev()) 142 * will take care of calling my_destructor and free_netdev(). 143 */ 144 145 return 0; 146 147 err_some_uninit: 148 some_uninit(priv); 149 err_free_dev: 150 free_netdev(dev); 151 return err; 152 } 153 154If struct net_device.priv_destructor is set it will be called by the core 155some time after unregister_netdevice(), it will also be called if 156register_netdevice() fails. The callback may be invoked with or without 157``rtnl_lock`` held. 158 159There is no explicit constructor callback, driver "constructs" the private 160netdev state after allocating it and before registration. 161 162Setting struct net_device.needs_free_netdev makes core call free_netdev() 163automatically after unregister_netdevice() when all references to the device 164are gone. It only takes effect after a successful call to register_netdevice() 165so if register_netdevice() fails driver is responsible for calling 166free_netdev(). 167 168free_netdev() is safe to call on error paths right after unregister_netdevice() 169or when register_netdevice() fails. Parts of netdev (de)registration process 170happen after ``rtnl_lock`` is released, therefore in those cases free_netdev() 171will defer some of the processing until ``rtnl_lock`` is released. 172 173Devices spawned from struct rtnl_link_ops should never free the 174struct net_device directly. 175 176.ndo_init and .ndo_uninit 177~~~~~~~~~~~~~~~~~~~~~~~~~ 178 179``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device 180registration and de-registration, under ``rtnl_lock``. Drivers can use 181those e.g. when parts of their init process need to run under ``rtnl_lock``. 182 183``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit`` 184runs during de-registering after device is closed but other subsystems 185may still have outstanding references to the netdevice. 186 187MTU 188=== 189Each network device has a Maximum Transfer Unit. The MTU does not 190include any link layer protocol overhead. Upper layer protocols must 191not pass a socket buffer (skb) to a device to transmit with more data 192than the mtu. The MTU does not include link layer header overhead, so 193for example on Ethernet if the standard MTU is 1500 bytes used, the 194actual skb will contain up to 1514 bytes because of the Ethernet 195header. Devices should allow for the 4 byte VLAN header as well. 196 197Segmentation Offload (GSO, TSO) is an exception to this rule. The 198upper layer protocol may pass a large socket buffer to the device 199transmit routine, and the device will break that up into separate 200packets based on the current MTU. 201 202MTU is symmetrical and applies both to receive and transmit. A device 203must be able to receive at least the maximum size packet allowed by 204the MTU. A network device may use the MTU as mechanism to size receive 205buffers, but the device should allow packets with VLAN header. With 206standard Ethernet mtu of 1500 bytes, the device should allow up to 2071518 byte packets (1500 + 14 header + 4 tag). The device may either: 208drop, truncate, or pass up oversize packets, but dropping oversize 209packets is preferred. 210 211 212struct net_device synchronization rules 213======================================= 214ndo_open: 215 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 216 lock if the driver implements queue management or shaper API. 217 Context: process 218 219ndo_stop: 220 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 221 lock if the driver implements queue management or shaper API. 222 Context: process 223 Note: netif_running() is guaranteed false 224 225ndo_do_ioctl: 226 Synchronization: rtnl_lock() semaphore. 227 228 This is only called by network subsystems internally, 229 not by user space calling ioctl as it was in before 230 linux-5.14. 231 232ndo_siocbond: 233 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 234 lock if the driver implements queue management or shaper API. 235 Context: process 236 237 Used by the bonding driver for the SIOCBOND family of 238 ioctl commands. 239 240ndo_siocwandev: 241 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 242 lock if the driver implements queue management or shaper API. 243 Context: process 244 245 Used by the drivers/net/wan framework to handle 246 the SIOCWANDEV ioctl with the if_settings structure. 247 248ndo_siocdevprivate: 249 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 250 lock if the driver implements queue management or shaper API. 251 Context: process 252 253 This is used to implement SIOCDEVPRIVATE ioctl helpers. 254 These should not be added to new drivers, so don't use. 255 256ndo_eth_ioctl: 257 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 258 lock if the driver implements queue management or shaper API. 259 Context: process 260 261ndo_get_stats / ndo_get_stats64: 262 Synchronization: RCU (can be called concurrently with the stats 263 update path). 264 Context: atomic (can't sleep under RCU) 265 266ndo_start_xmit: 267 Synchronization: __netif_tx_lock spinlock. 268 269 When the driver sets dev->lltx this will be called without holding 270 netif_tx_lock. dev->lltx is meant for software drivers only, since 271 they often have no per-queue state. 272 273 Context: Process with BHs disabled or BH (timer), 274 will be called with interrupts disabled by netconsole. 275 276 Return codes: 277 278 * NETDEV_TX_OK everything ok. 279 * NETDEV_TX_BUSY Cannot transmit packet, try later 280 Usually a bug, means queue start/stop flow control is broken in 281 the driver. Note: the driver must NOT put the skb in its DMA ring. 282 283ndo_tx_timeout: 284 Synchronization: netif_tx_lock spinlock; all TX queues frozen. 285 Context: BHs disabled 286 Notes: netif_queue_stopped() is guaranteed true 287 288ndo_set_rx_mode: 289 Synchronization: netif_addr_lock spinlock. 290 Context: BHs disabled 291 Notes: Deprecated in favor of ndo_set_rx_mode_async which runs 292 in process context. 293 294ndo_set_rx_mode_async: 295 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 296 lock if the driver implements queue management or shaper API. 297 Context: process (from a work queue) 298 Notes: Async version of ndo_set_rx_mode which runs in process 299 context. Receives snapshots of the unicast and multicast address lists. 300 301ndo_change_rx_flags: 302 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 303 lock if the driver implements queue management or shaper API. 304 305ndo_setup_tc: 306 Locking depends on ``tc_setup_type``. For most types the callback 307 is invoked under ``rtnl_lock`` and netdev instance lock if the driver 308 implements queue management or shaper API. 309 310 For ``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` ``rtnl_lock`` may or 311 may not be held, and the netdev instance lock is not held. 312 ``TC_SETUP_BLOCK`` runs under ``block->cb_lock`` and ``TC_SETUP_FT`` 313 runs under ``flowtable->flow_block_lock``. 314 315Most ndo callbacks not specified in the list above are running 316under ``rtnl_lock``. In addition, netdev instance lock is taken as well if 317the driver implements queue management or shaper API. 318 319struct napi_struct synchronization rules 320======================================== 321napi->poll: 322 Synchronization: 323 NAPI_STATE_SCHED bit in napi->state. Device 324 driver's ndo_stop method will invoke napi_disable() on 325 all NAPI instances which will do a sleeping poll on the 326 NAPI_STATE_SCHED napi->state bit, waiting for all pending 327 NAPI activity to cease. 328 329 Context: 330 softirq 331 will be called with interrupts disabled by netconsole. 332 333netdev instance lock 334==================== 335 336Historically, all networking control operations were protected by a single 337global lock known as ``rtnl_lock``. There is an ongoing effort to replace this 338global lock with separate locks for each network namespace. Additionally, 339properties of individual netdev are increasingly protected by per-netdev locks. 340 341For device drivers that implement shaping or queue management APIs, all control 342operations will be performed under the netdev instance lock. 343Drivers can also explicitly request instance lock to be held during ops 344by setting ``request_ops_lock`` to true. Code comments and docs refer 345to drivers which have ops called under the instance lock as "ops locked". 346See also the documentation of the ``lock`` member of struct net_device. 347 348There is also a case of taking two per-netdev locks in sequence when netdev 349queues are leased, that is, the netdev-scope lock is taken for both the 350virtual and the physical device. To prevent deadlocks, the virtual device's 351lock must always be acquired before the physical device's (see 352``netdev_nl_queue_create_doit``). 353 354Device drivers are encouraged to rely on the instance lock where possible. 355 356For the (mostly software) drivers that need to interact with the core stack, 357there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx`` 358(e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx`` 359functions handle acquiring the instance lock themselves, while the 360``netif_xxx`` functions assume that the driver has already acquired 361the instance lock. 362 363struct net_device_ops 364--------------------- 365 366``ndos`` are called without holding the instance lock for most drivers. 367 368"Ops locked" drivers will have most of the ``ndos`` invoked under 369the instance lock. 370 371struct ethtool_ops 372------------------ 373 374For non-"ops locked" drivers ethtool_ops are executed under ``rtnl_lock``. 375 376For "ops locked" drivers, ``ethtool_ops``, unlike ``ndos``, run under 377the instance lock **only**. Drivers may request that ``rtnl_lock`` 378is held around specific operations (both SET and GET) by setting 379appropriate bits in ``ethtool_ops::op_needs_rtnl`` (if the necessary 380``ETHTOOL_OP_NEEDS_RTNL_*`` bit doesn't exist, just add it). 381Commonly used core helpers which force drivers to selectively opt-in to 382``rtnl_lock`` protection include ``netdev_update_features()``, 383``netif_set_real_num_tx_queues()``, and phylink helpers. 384 385struct netdev_stat_ops 386---------------------- 387 388"qstat" ops are invoked under the instance lock for "ops locked" drivers, 389and under rtnl_lock for all other drivers. 390 391struct net_shaper_ops 392--------------------- 393 394All net shaper callbacks are invoked while holding the netdev instance 395lock. ``rtnl_lock`` may or may not be held. 396 397Note that supporting net shapers automatically enables "ops locking". 398 399struct netdev_queue_mgmt_ops 400---------------------------- 401 402All queue management callbacks are invoked while holding the netdev instance 403lock. ``rtnl_lock`` may or may not be held. 404 405Note that supporting struct netdev_queue_mgmt_ops automatically enables 406"ops locking". 407 408Notifiers and netdev instance lock 409---------------------------------- 410 411For device drivers that implement shaping or queue management APIs, 412some of the notifiers (``enum netdev_cmd``) are running under the netdev 413instance lock. 414 415The following netdev notifiers are always run under the instance lock: 416* ``NETDEV_XDP_FEAT_CHANGE`` 417 418For devices with locked ops, currently only the following notifiers are 419running under the lock: 420* ``NETDEV_CHANGE`` 421* ``NETDEV_CHANGENAME`` 422* ``NETDEV_REGISTER`` 423* ``NETDEV_UP`` 424 425The following notifiers are running without the lock: 426* ``NETDEV_UNREGISTER`` 427 428There are no clear expectations for the remaining notifiers. Notifiers not on 429the list may run with or without the instance lock, potentially even invoking 430the same notifier type with and without the lock from different code paths. 431The goal is to eventually ensure that all (or most, with a few documented 432exceptions) notifiers run under the instance lock. Please extend this 433documentation whenever you make explicit assumption about lock being held 434from a notifier. 435 436NETDEV_INTERNAL symbol namespace 437================================ 438 439Symbols exported as NETDEV_INTERNAL can only be used in networking 440core and drivers which exclusively flow via the main networking list and trees. 441Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL 442are not expected to be used by random code outside netdev either. 443Symbols may lack the designation because they predate the namespaces, 444or simply due to an oversight. 445