xref: /linux/Documentation/networking/netdevices.rst (revision 785151f50ddacac06c7a3c5f3d31642794507fdf)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================================
4Network Devices, the Kernel, and You!
5=====================================
6
7
8Introduction
9============
10The following is a random collection of documentation regarding
11network devices. It is intended for driver developers.
12
13struct net_device lifetime rules
14================================
15Network device structures need to persist even after module is unloaded and
16must be allocated with alloc_netdev_mqs() and friends.
17If device has registered successfully, it will be freed on last use
18by free_netdev(). This is required to handle the pathological case cleanly
19(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
20
21alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
22private data which gets freed when the network device is freed. If
23separately allocated data is attached to the network device
24(netdev_priv()) then it is up to the module exit handler to free that.
25
26There are two groups of APIs for registering struct net_device.
27First group can be used in normal contexts where ``rtnl_lock`` is not already
28held: register_netdev(), unregister_netdev().
29Second group can be used when ``rtnl_lock`` is already held:
30register_netdevice(), unregister_netdevice(), free_netdevice().
31
32Simple drivers
33--------------
34
35Most drivers (especially device drivers) handle lifetime of struct net_device
36in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
37
38In that case the struct net_device registration is done using
39the register_netdev(), and unregister_netdev() functions:
40
41.. code-block:: c
42
43  int probe()
44  {
45    struct my_device_priv *priv;
46    int err;
47
48    dev = alloc_netdev_mqs(...);
49    if (!dev)
50      return -ENOMEM;
51    priv = netdev_priv(dev);
52
53    /* ... do all device setup before calling register_netdev() ...
54     */
55
56    err = register_netdev(dev);
57    if (err)
58      goto err_undo;
59
60    /* net_device is visible to the user! */
61
62  err_undo:
63    /* ... undo the device setup ... */
64    free_netdev(dev);
65    return err;
66  }
67
68  void remove()
69  {
70    unregister_netdev(dev);
71    free_netdev(dev);
72  }
73
74Note that after calling register_netdev() the device is visible in the system.
75Users can open it and start sending / receiving traffic immediately,
76or run any other callback, so all initialization must be done prior to
77registration.
78
79unregister_netdev() closes the device and waits for all users to be done
80with it. The memory of struct net_device itself may still be referenced
81by sysfs but all operations on that device will fail.
82
83free_netdev() can be called after unregister_netdev() returns on when
84register_netdev() failed.
85
86Device management under RTNL
87----------------------------
88
89Registering struct net_device while in context which already holds
90the ``rtnl_lock`` requires extra care. In those scenarios most drivers
91will want to make use of struct net_device's ``needs_free_netdev``
92and ``priv_destructor`` members for freeing of state.
93
94Example flow of netdev handling under ``rtnl_lock``:
95
96.. code-block:: c
97
98  static void my_setup(struct net_device *dev)
99  {
100    dev->needs_free_netdev = true;
101  }
102
103  static void my_destructor(struct net_device *dev)
104  {
105    some_obj_destroy(priv->obj);
106    some_uninit(priv);
107  }
108
109  int create_link()
110  {
111    struct my_device_priv *priv;
112    int err;
113
114    ASSERT_RTNL();
115
116    dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
117    if (!dev)
118      return -ENOMEM;
119    priv = netdev_priv(dev);
120
121    /* Implicit constructor */
122    err = some_init(priv);
123    if (err)
124      goto err_free_dev;
125
126    priv->obj = some_obj_create();
127    if (!priv->obj) {
128      err = -ENOMEM;
129      goto err_some_uninit;
130    }
131    /* End of constructor, set the destructor: */
132    dev->priv_destructor = my_destructor;
133
134    err = register_netdevice(dev);
135    if (err)
136      /* register_netdevice() calls destructor on failure */
137      goto err_free_dev;
138
139    /* If anything fails now unregister_netdevice() (or unregister_netdev())
140     * will take care of calling my_destructor and free_netdev().
141     */
142
143    return 0;
144
145  err_some_uninit:
146    some_uninit(priv);
147  err_free_dev:
148    free_netdev(dev);
149    return err;
150  }
151
152If struct net_device.priv_destructor is set it will be called by the core
153some time after unregister_netdevice(), it will also be called if
154register_netdevice() fails. The callback may be invoked with or without
155``rtnl_lock`` held.
156
157There is no explicit constructor callback, driver "constructs" the private
158netdev state after allocating it and before registration.
159
160Setting struct net_device.needs_free_netdev makes core call free_netdevice()
161automatically after unregister_netdevice() when all references to the device
162are gone. It only takes effect after a successful call to register_netdevice()
163so if register_netdevice() fails driver is responsible for calling
164free_netdev().
165
166free_netdev() is safe to call on error paths right after unregister_netdevice()
167or when register_netdevice() fails. Parts of netdev (de)registration process
168happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
169will defer some of the processing until ``rtnl_lock`` is released.
170
171Devices spawned from struct rtnl_link_ops should never free the
172struct net_device directly.
173
174.ndo_init and .ndo_uninit
175~~~~~~~~~~~~~~~~~~~~~~~~~
176
177``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
178registration and de-registration, under ``rtnl_lock``. Drivers can use
179those e.g. when parts of their init process need to run under ``rtnl_lock``.
180
181``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
182runs during de-registering after device is closed but other subsystems
183may still have outstanding references to the netdevice.
184
185MTU
186===
187Each network device has a Maximum Transfer Unit. The MTU does not
188include any link layer protocol overhead. Upper layer protocols must
189not pass a socket buffer (skb) to a device to transmit with more data
190than the mtu. The MTU does not include link layer header overhead, so
191for example on Ethernet if the standard MTU is 1500 bytes used, the
192actual skb will contain up to 1514 bytes because of the Ethernet
193header. Devices should allow for the 4 byte VLAN header as well.
194
195Segmentation Offload (GSO, TSO) is an exception to this rule.  The
196upper layer protocol may pass a large socket buffer to the device
197transmit routine, and the device will break that up into separate
198packets based on the current MTU.
199
200MTU is symmetrical and applies both to receive and transmit. A device
201must be able to receive at least the maximum size packet allowed by
202the MTU. A network device may use the MTU as mechanism to size receive
203buffers, but the device should allow packets with VLAN header. With
204standard Ethernet mtu of 1500 bytes, the device should allow up to
2051518 byte packets (1500 + 14 header + 4 tag).  The device may either:
206drop, truncate, or pass up oversize packets, but dropping oversize
207packets is preferred.
208
209
210struct net_device synchronization rules
211=======================================
212ndo_open:
213	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
214	lock if the driver implements queue management or shaper API.
215	Context: process
216
217ndo_stop:
218	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
219	lock if the driver implements queue management or shaper API.
220	Context: process
221	Note: netif_running() is guaranteed false
222
223ndo_do_ioctl:
224	Synchronization: rtnl_lock() semaphore.
225
226	This is only called by network subsystems internally,
227	not by user space calling ioctl as it was in before
228	linux-5.14.
229
230ndo_siocbond:
231	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
232	lock if the driver implements queue management or shaper API.
233        Context: process
234
235	Used by the bonding driver for the SIOCBOND family of
236	ioctl commands.
237
238ndo_siocwandev:
239	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
240	lock if the driver implements queue management or shaper API.
241	Context: process
242
243	Used by the drivers/net/wan framework to handle
244	the SIOCWANDEV ioctl with the if_settings structure.
245
246ndo_siocdevprivate:
247	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
248	lock if the driver implements queue management or shaper API.
249	Context: process
250
251	This is used to implement SIOCDEVPRIVATE ioctl helpers.
252	These should not be added to new drivers, so don't use.
253
254ndo_eth_ioctl:
255	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
256	lock if the driver implements queue management or shaper API.
257	Context: process
258
259ndo_get_stats:
260	Synchronization: RCU (can be called concurrently with the stats
261	update path).
262	Context: atomic (can't sleep under RCU)
263
264ndo_start_xmit:
265	Synchronization: __netif_tx_lock spinlock.
266
267	When the driver sets dev->lltx this will be
268	called without holding netif_tx_lock. In this case the driver
269	has to lock by itself when needed.
270	The locking there should also properly protect against
271	set_rx_mode. WARNING: use of dev->lltx is deprecated.
272	Don't use it for new drivers.
273
274	Context: Process with BHs disabled or BH (timer),
275		 will be called with interrupts disabled by netconsole.
276
277	Return codes:
278
279	* NETDEV_TX_OK everything ok.
280	* NETDEV_TX_BUSY Cannot transmit packet, try later
281	  Usually a bug, means queue start/stop flow control is broken in
282	  the driver. Note: the driver must NOT put the skb in its DMA ring.
283
284ndo_tx_timeout:
285	Synchronization: netif_tx_lock spinlock; all TX queues frozen.
286	Context: BHs disabled
287	Notes: netif_queue_stopped() is guaranteed true
288
289ndo_set_rx_mode:
290	Synchronization: netif_addr_lock spinlock.
291	Context: BHs disabled
292
293ndo_setup_tc:
294	``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks
295	(i.e. no ``rtnl_lock`` and no device instance lock). The rest of
296	``tc_setup_type`` types run under netdev instance lock if the driver
297	implements queue management or shaper API.
298
299Most ndo callbacks not specified in the list above are running
300under ``rtnl_lock``. In addition, netdev instance lock is taken as well if
301the driver implements queue management or shaper API.
302
303struct napi_struct synchronization rules
304========================================
305napi->poll:
306	Synchronization:
307		NAPI_STATE_SCHED bit in napi->state.  Device
308		driver's ndo_stop method will invoke napi_disable() on
309		all NAPI instances which will do a sleeping poll on the
310		NAPI_STATE_SCHED napi->state bit, waiting for all pending
311		NAPI activity to cease.
312
313	Context:
314		 softirq
315		 will be called with interrupts disabled by netconsole.
316
317netdev instance lock
318====================
319
320Historically, all networking control operations were protected by a single
321global lock known as ``rtnl_lock``. There is an ongoing effort to replace this
322global lock with separate locks for each network namespace. Additionally,
323properties of individual netdev are increasingly protected by per-netdev locks.
324
325For device drivers that implement shaping or queue management APIs, all control
326operations will be performed under the netdev instance lock.
327Drivers can also explicitly request instance lock to be held during ops
328by setting ``request_ops_lock`` to true. Code comments and docs refer
329to drivers which have ops called under the instance lock as "ops locked".
330See also the documentation of the ``lock`` member of struct net_device.
331
332In the future, there will be an option for individual
333drivers to opt out of using ``rtnl_lock`` and instead perform their control
334operations directly under the netdev instance lock.
335
336Devices drivers are encouraged to rely on the instance lock where possible.
337
338For the (mostly software) drivers that need to interact with the core stack,
339there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx``
340(e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx``
341functions handle acquiring the instance lock themselves, while the
342``netif_xxx`` functions assume that the driver has already acquired
343the instance lock.
344
345struct net_device_ops
346---------------------
347
348``ndos`` are called without holding the instance lock for most drivers.
349
350"Ops locked" drivers will have most of the ``ndos`` invoked under
351the instance lock.
352
353struct ethtool_ops
354------------------
355
356Similarly to ``ndos`` the instance lock is only held for select drivers.
357For "ops locked" drivers all ethtool ops without exceptions should
358be called under the instance lock.
359
360struct netdev_stat_ops
361----------------------
362
363"qstat" ops are invoked under the instance lock for "ops locked" drivers,
364and under rtnl_lock for all other drivers.
365
366struct net_shaper_ops
367---------------------
368
369All net shaper callbacks are invoked while holding the netdev instance
370lock. ``rtnl_lock`` may or may not be held.
371
372Note that supporting net shapers automatically enables "ops locking".
373
374struct netdev_queue_mgmt_ops
375----------------------------
376
377All queue management callbacks are invoked while holding the netdev instance
378lock. ``rtnl_lock`` may or may not be held.
379
380Note that supporting struct netdev_queue_mgmt_ops automatically enables
381"ops locking".
382
383Notifiers and netdev instance lock
384----------------------------------
385
386For device drivers that implement shaping or queue management APIs,
387some of the notifiers (``enum netdev_cmd``) are running under the netdev
388instance lock.
389
390The following netdev notifiers are always run under the instance lock:
391* ``NETDEV_XDP_FEAT_CHANGE``
392
393For devices with locked ops, currently only the following notifiers are
394running under the lock:
395* ``NETDEV_CHANGE``
396* ``NETDEV_REGISTER``
397* ``NETDEV_UP``
398
399The following notifiers are running without the lock:
400* ``NETDEV_UNREGISTER``
401
402There are no clear expectations for the remaining notifiers. Notifiers not on
403the list may run with or without the instance lock, potentially even invoking
404the same notifier type with and without the lock from different code paths.
405The goal is to eventually ensure that all (or most, with a few documented
406exceptions) notifiers run under the instance lock. Please extend this
407documentation whenever you make explicit assumption about lock being held
408from a notifier.
409
410NETDEV_INTERNAL symbol namespace
411================================
412
413Symbols exported as NETDEV_INTERNAL can only be used in networking
414core and drivers which exclusively flow via the main networking list and trees.
415Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL
416are not expected to be used by random code outside netdev either.
417Symbols may lack the designation because they predate the namespaces,
418or simply due to an oversight.
419