xref: /linux/Documentation/networking/netdevices.rst (revision 4b99990cdf9560e8a071640baf19f312e6ae02f4)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================================
4Network Devices, the Kernel, and You!
5=====================================
6
7
8Introduction
9============
10The following is a random collection of documentation regarding
11network devices. It is intended for driver developers.
12
13struct net_device lifetime rules
14================================
15Network device structures need to persist even after module is unloaded and
16must be allocated with alloc_netdev_mqs() and friends.
17If device has registered successfully, it will be freed on last use
18by free_netdev(). This is required to handle the pathological case cleanly
19(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
20
21alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
22private data which gets freed when the network device is freed. If
23separately allocated data is attached to the network device
24(extra pointers stored in the device private struct) then it is up
25to the module exit handler to free that.
26
27There are two groups of APIs for registering struct net_device.
28First group can be used in normal contexts where ``rtnl_lock`` is not already
29held: register_netdev(), unregister_netdev().
30Second group can be used when ``rtnl_lock`` is already held:
31register_netdevice(), unregister_netdevice(), free_netdev().
32
33Simple drivers
34--------------
35
36Most drivers (especially device drivers) handle lifetime of struct net_device
37in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
38
39In that case the struct net_device registration is done using
40the register_netdev(), and unregister_netdev() functions:
41
42.. code-block:: c
43
44  int probe()
45  {
46    struct my_device_priv *priv;
47    int err;
48
49    dev = alloc_netdev_mqs(...);
50    if (!dev)
51      return -ENOMEM;
52    priv = netdev_priv(dev);
53
54    /* ... do all device setup before calling register_netdev() ...
55     */
56
57    err = register_netdev(dev);
58    if (err)
59      goto err_undo;
60
61    /* net_device is visible to the user! */
62    return 0;
63
64  err_undo:
65    /* ... undo the device setup ... */
66    free_netdev(dev);
67    return err;
68  }
69
70  void remove()
71  {
72    unregister_netdev(dev);
73    free_netdev(dev);
74  }
75
76Note that after calling register_netdev() the device is visible in the system.
77Users can open it and start sending / receiving traffic immediately,
78or run any other callback, so all initialization must be **complete** prior to
79registration.
80
81unregister_netdev() closes the device and waits for all users to be done
82with it. The memory of struct net_device itself may still be referenced
83by sysfs but all operations on that device will fail.
84
85free_netdev() can be called after unregister_netdev() returns or when
86register_netdev() failed.
87
88Device management under RTNL
89----------------------------
90
91Registering struct net_device while in context which already holds
92the ``rtnl_lock`` requires extra care. In those scenarios most drivers
93will want to make use of struct net_device's ``needs_free_netdev``
94and ``priv_destructor`` members for freeing of state.
95
96Example flow of netdev handling under ``rtnl_lock``:
97
98.. code-block:: c
99
100  static void my_setup(struct net_device *dev)
101  {
102    dev->needs_free_netdev = true;
103  }
104
105  static void my_destructor(struct net_device *dev)
106  {
107    some_obj_destroy(priv->obj);
108    some_uninit(priv);
109  }
110
111  int create_link()
112  {
113    struct my_device_priv *priv;
114    int err;
115
116    ASSERT_RTNL();
117
118    dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
119    if (!dev)
120      return -ENOMEM;
121    priv = netdev_priv(dev);
122
123    /* Implicit constructor */
124    err = some_init(priv);
125    if (err)
126      goto err_free_dev;
127
128    priv->obj = some_obj_create();
129    if (!priv->obj) {
130      err = -ENOMEM;
131      goto err_some_uninit;
132    }
133    /* End of constructor, set the destructor: */
134    dev->priv_destructor = my_destructor;
135
136    err = register_netdevice(dev);
137    if (err)
138      /* register_netdevice() calls destructor on failure */
139      goto err_free_dev;
140
141    /* If anything fails now unregister_netdevice() (or unregister_netdev())
142     * will take care of calling my_destructor and free_netdev().
143     */
144
145    return 0;
146
147  err_some_uninit:
148    some_uninit(priv);
149  err_free_dev:
150    free_netdev(dev);
151    return err;
152  }
153
154If struct net_device.priv_destructor is set it will be called by the core
155some time after unregister_netdevice(), it will also be called if
156register_netdevice() fails. The callback may be invoked with or without
157``rtnl_lock`` held.
158
159There is no explicit constructor callback, driver "constructs" the private
160netdev state after allocating it and before registration.
161
162Setting struct net_device.needs_free_netdev makes core call free_netdev()
163automatically after unregister_netdevice() when all references to the device
164are gone. It only takes effect after a successful call to register_netdevice()
165so if register_netdevice() fails driver is responsible for calling
166free_netdev().
167
168free_netdev() is safe to call on error paths right after unregister_netdevice()
169or when register_netdevice() fails. Parts of netdev (de)registration process
170happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
171will defer some of the processing until ``rtnl_lock`` is released.
172
173Devices spawned from struct rtnl_link_ops should never free the
174struct net_device directly.
175
176.ndo_init and .ndo_uninit
177~~~~~~~~~~~~~~~~~~~~~~~~~
178
179``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
180registration and de-registration, under ``rtnl_lock``. Drivers can use
181those e.g. when parts of their init process need to run under ``rtnl_lock``.
182
183``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
184runs during de-registering after device is closed but other subsystems
185may still have outstanding references to the netdevice.
186
187MTU
188===
189Each network device has a Maximum Transfer Unit. The MTU does not
190include any link layer protocol overhead. Upper layer protocols must
191not pass a socket buffer (skb) to a device to transmit with more data
192than the mtu. The MTU does not include link layer header overhead, so
193for example on Ethernet if the standard MTU is 1500 bytes used, the
194actual skb will contain up to 1514 bytes because of the Ethernet
195header. Devices should allow for the 4 byte VLAN header as well.
196
197Segmentation Offload (GSO, TSO) is an exception to this rule.  The
198upper layer protocol may pass a large socket buffer to the device
199transmit routine, and the device will break that up into separate
200packets based on the current MTU.
201
202MTU is symmetrical and applies both to receive and transmit. A device
203must be able to receive at least the maximum size packet allowed by
204the MTU. A network device may use the MTU as mechanism to size receive
205buffers, but the device should allow packets with VLAN header. With
206standard Ethernet mtu of 1500 bytes, the device should allow up to
2071518 byte packets (1500 + 14 header + 4 tag).  The device may either:
208drop, truncate, or pass up oversize packets, but dropping oversize
209packets is preferred.
210
211
212struct net_device synchronization rules
213=======================================
214ndo_open:
215	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
216	lock if the driver implements queue management or shaper API.
217	Context: process
218
219ndo_stop:
220	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
221	lock if the driver implements queue management or shaper API.
222	Context: process
223	Note: netif_running() is guaranteed false
224
225ndo_do_ioctl:
226	Synchronization: rtnl_lock() semaphore.
227
228	This is only called by network subsystems internally,
229	not by user space calling ioctl as it was in before
230	linux-5.14.
231
232ndo_siocbond:
233	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
234	lock if the driver implements queue management or shaper API.
235        Context: process
236
237	Used by the bonding driver for the SIOCBOND family of
238	ioctl commands.
239
240ndo_siocwandev:
241	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
242	lock if the driver implements queue management or shaper API.
243	Context: process
244
245	Used by the drivers/net/wan framework to handle
246	the SIOCWANDEV ioctl with the if_settings structure.
247
248ndo_siocdevprivate:
249	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
250	lock if the driver implements queue management or shaper API.
251	Context: process
252
253	This is used to implement SIOCDEVPRIVATE ioctl helpers.
254	These should not be added to new drivers, so don't use.
255
256ndo_eth_ioctl:
257	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
258	lock if the driver implements queue management or shaper API.
259	Context: process
260
261ndo_get_stats / ndo_get_stats64:
262	Synchronization: RCU (can be called concurrently with the stats
263	update path).
264	Context: atomic (can't sleep under RCU)
265
266ndo_start_xmit:
267	Synchronization: __netif_tx_lock spinlock.
268
269	When the driver sets dev->lltx this will be called without holding
270	netif_tx_lock. dev->lltx is meant for software drivers only, since
271	they often have no per-queue state.
272
273	Context: Process with BHs disabled or BH (timer),
274		 will be called with interrupts disabled by netconsole.
275
276	Return codes:
277
278	* NETDEV_TX_OK everything ok.
279	* NETDEV_TX_BUSY Cannot transmit packet, try later
280	  Usually a bug, means queue start/stop flow control is broken in
281	  the driver. Note: the driver must NOT put the skb in its DMA ring.
282
283ndo_tx_timeout:
284	Synchronization: netif_tx_lock spinlock; all TX queues frozen.
285	Context: BHs disabled
286	Notes: netif_queue_stopped() is guaranteed true
287
288ndo_set_rx_mode:
289	Synchronization: netif_addr_lock spinlock.
290	Context: BHs disabled
291	Notes: Deprecated in favor of ndo_set_rx_mode_async which runs
292	in process context.
293
294ndo_set_rx_mode_async:
295	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
296	lock if the driver implements queue management or shaper API.
297	Context: process (from a work queue)
298	Notes: Async version of ndo_set_rx_mode which runs in process
299	context. Receives snapshots of the unicast and multicast address lists.
300
301ndo_change_rx_flags:
302	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
303	lock if the driver implements queue management or shaper API.
304
305ndo_setup_tc:
306	Locking depends on ``tc_setup_type``. For most types the callback
307	is invoked under ``rtnl_lock`` and netdev instance lock if the driver
308	implements queue management or shaper API.
309
310	For ``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` ``rtnl_lock`` may or
311	may not be held, and the netdev instance lock is not held.
312	``TC_SETUP_BLOCK`` runs under ``block->cb_lock`` and ``TC_SETUP_FT``
313	runs under ``flowtable->flow_block_lock``.
314
315Most ndo callbacks not specified in the list above are running
316under ``rtnl_lock``. In addition, netdev instance lock is taken as well if
317the driver implements queue management or shaper API.
318
319struct napi_struct synchronization rules
320========================================
321napi->poll:
322	Synchronization:
323		NAPI_STATE_SCHED bit in napi->state.  Device
324		driver's ndo_stop method will invoke napi_disable() on
325		all NAPI instances which will do a sleeping poll on the
326		NAPI_STATE_SCHED napi->state bit, waiting for all pending
327		NAPI activity to cease.
328
329	Context:
330		 softirq
331		 will be called with interrupts disabled by netconsole.
332
333netdev instance lock
334====================
335
336Historically, all networking control operations were protected by a single
337global lock known as ``rtnl_lock``. There is an ongoing effort to replace this
338global lock with separate locks for each network namespace. Additionally,
339properties of individual netdev are increasingly protected by per-netdev locks.
340
341For device drivers that implement shaping or queue management APIs, all control
342operations will be performed under the netdev instance lock.
343Drivers can also explicitly request instance lock to be held during ops
344by setting ``request_ops_lock`` to true. Code comments and docs refer
345to drivers which have ops called under the instance lock as "ops locked".
346See also the documentation of the ``lock`` member of struct net_device.
347
348There is also a case of taking two per-netdev locks in sequence when netdev
349queues are leased, that is, the netdev-scope lock is taken for both the
350virtual and the physical device. To prevent deadlocks, the virtual device's
351lock must always be acquired before the physical device's (see
352``netdev_nl_queue_create_doit``).
353
354Device drivers are encouraged to rely on the instance lock where possible.
355
356For the (mostly software) drivers that need to interact with the core stack,
357there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx``
358(e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx``
359functions handle acquiring the instance lock themselves, while the
360``netif_xxx`` functions assume that the driver has already acquired
361the instance lock.
362
363struct net_device_ops
364---------------------
365
366``ndos`` are called without holding the instance lock for most drivers.
367
368"Ops locked" drivers will have most of the ``ndos`` invoked under
369the instance lock.
370
371struct ethtool_ops
372------------------
373
374For non-"ops locked" drivers ethtool_ops are executed under ``rtnl_lock``.
375
376For "ops locked" drivers, ``ethtool_ops``, unlike ``ndos``, run under
377the instance lock **only**. Drivers may request that ``rtnl_lock``
378is held around specific operations (both SET and GET) by setting
379appropriate bits in ``ethtool_ops::op_needs_rtnl`` (if the necessary
380``ETHTOOL_OP_NEEDS_RTNL_*`` bit doesn't exist, just add it).
381Commonly used core helpers which force drivers to selectively opt-in to
382``rtnl_lock`` protection include ``netdev_update_features()``,
383``netif_set_real_num_tx_queues()``, and phylink helpers.
384
385struct netdev_stat_ops
386----------------------
387
388"qstat" ops are invoked under the instance lock for "ops locked" drivers,
389and under rtnl_lock for all other drivers.
390
391struct net_shaper_ops
392---------------------
393
394All net shaper callbacks are invoked while holding the netdev instance
395lock. ``rtnl_lock`` may or may not be held.
396
397Note that supporting net shapers automatically enables "ops locking".
398
399struct netdev_queue_mgmt_ops
400----------------------------
401
402All queue management callbacks are invoked while holding the netdev instance
403lock. ``rtnl_lock`` may or may not be held.
404
405Note that supporting struct netdev_queue_mgmt_ops automatically enables
406"ops locking".
407
408Notifiers and netdev instance lock
409----------------------------------
410
411For device drivers that implement shaping or queue management APIs,
412some of the notifiers (``enum netdev_cmd``) are running under the netdev
413instance lock.
414
415The following netdev notifiers are always run under the instance lock:
416* ``NETDEV_XDP_FEAT_CHANGE``
417
418For devices with locked ops, currently only the following notifiers are
419running under the lock:
420* ``NETDEV_CHANGE``
421* ``NETDEV_CHANGENAME``
422* ``NETDEV_REGISTER``
423* ``NETDEV_UP``
424
425The following notifiers are running without the lock:
426* ``NETDEV_UNREGISTER``
427
428There are no clear expectations for the remaining notifiers. Notifiers not on
429the list may run with or without the instance lock, potentially even invoking
430the same notifier type with and without the lock from different code paths.
431The goal is to eventually ensure that all (or most, with a few documented
432exceptions) notifiers run under the instance lock. Please extend this
433documentation whenever you make explicit assumption about lock being held
434from a notifier.
435
436NETDEV_INTERNAL symbol namespace
437================================
438
439Symbols exported as NETDEV_INTERNAL can only be used in networking
440core and drivers which exclusively flow via the main networking list and trees.
441Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL
442are not expected to be used by random code outside netdev either.
443Symbols may lack the designation because they predate the namespaces,
444or simply due to an oversight.
445