xref: /linux/Documentation/networking/devmem.rst (revision df9c299371054cb725eef730fd0f1d0fe2ed6bb0)
1.. SPDX-License-Identifier: GPL-2.0
2
3=================
4Device Memory TCP
5=================
6
7
8Intro
9=====
10
11Device memory TCP (devmem TCP) enables receiving data directly into device
12memory (dmabuf). The feature is currently implemented for TCP sockets.
13
14
15Opportunity
16-----------
17
18A large number of data transfers have device memory as the source and/or
19destination. Accelerators drastically increased the prevalence of such
20transfers.  Some examples include:
21
22- Distributed training, where ML accelerators, such as GPUs on different hosts,
23  exchange data.
24
25- Distributed raw block storage applications transfer large amounts of data with
26  remote SSDs. Much of this data does not require host processing.
27
28Typically the Device-to-Device data transfers in the network are implemented as
29the following low-level operations: Device-to-Host copy, Host-to-Host network
30transfer, and Host-to-Device copy.
31
32The flow involving host copies is suboptimal, especially for bulk data transfers,
33and can put significant strains on system resources such as host memory
34bandwidth and PCIe bandwidth.
35
36Devmem TCP optimizes this use case by implementing socket APIs that enable
37the user to receive incoming network packets directly into device memory.
38
39Packet payloads go directly from the NIC to device memory.
40
41Packet headers go to host memory and are processed by the TCP/IP stack
42normally. The NIC must support header split to achieve this.
43
44Advantages:
45
46- Alleviate host memory bandwidth pressure, compared to existing
47  network-transfer + device-copy semantics.
48
49- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
50  level of the PCIe tree, compared to the traditional path which sends data
51  through the root complex.
52
53
54More Info
55---------
56
57  slides, video
58    https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
59
60  patchset
61    [PATCH net-next v24 00/13] Device Memory TCP
62    https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
63
64
65RX Interface
66============
67
68
69Example
70-------
71
72./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of
73setting up the RX path of this API.
74
75
76NIC Setup
77---------
78
79Header split, flow steering, & RSS are required features for devmem TCP.
80
81Header split is used to split incoming packets into a header buffer in host
82memory, and a payload buffer in device memory.
83
84Flow steering & RSS are used to ensure that only flows targeting devmem land on
85an RX queue bound to devmem.
86
87Enable header split & flow steering::
88
89	# enable header split
90	ethtool -G eth1 tcp-data-split on
91
92
93	# enable flow steering
94	ethtool -K eth1 ntuple on
95
96Configure RSS to steer all traffic away from the target RX queue (queue 15 in
97this example)::
98
99	ethtool --set-rxfh-indir eth1 equal 15
100
101
102The user must bind a dmabuf to any number of RX queues on a given NIC using
103the netlink API::
104
105	/* Bind dmabuf to NIC RX queue 15 */
106	struct netdev_queue *queues;
107	queues = malloc(sizeof(*queues) * 1);
108
109	queues[0]._present.type = 1;
110	queues[0]._present.idx = 1;
111	queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
112	queues[0].idx = 15;
113
114	*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
115
116	req = netdev_bind_rx_req_alloc();
117	netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
118	netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
119	__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
120
121	rsp = netdev_bind_rx(*ys, req);
122
123	dmabuf_id = rsp->dmabuf_id;
124
125
126The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
127that has been bound.
128
129The user can unbind the dmabuf from the netdevice by closing the netlink socket
130that established the binding. We do this so that the binding is automatically
131unbound even if the userspace process crashes.
132
133Note that any reasonably well-behaved dmabuf from any exporter should work with
134devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
135this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
136
137
138Socket Setup
139------------
140
141The socket must be flow steered to the dmabuf bound RX queue::
142
143	ethtool -N eth1 flow-type tcp4 ... queue 15
144
145
146Receiving data
147--------------
148
149The user application must signal to the kernel that it is capable of receiving
150devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
151
152	ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
153
154Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
155on devmem data.
156
157Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
158Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
159
160		for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
161			if (cm->cmsg_level != SOL_SOCKET ||
162				(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
163				 cm->cmsg_type != SCM_DEVMEM_LINEAR))
164				continue;
165
166			dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
167
168			if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
169				/* Frag landed in dmabuf.
170				 *
171				 * dmabuf_cmsg->dmabuf_id is the dmabuf the
172				 * frag landed on.
173				 *
174				 * dmabuf_cmsg->frag_offset is the offset into
175				 * the dmabuf where the frag starts.
176				 *
177				 * dmabuf_cmsg->frag_size is the size of the
178				 * frag.
179				 *
180				 * dmabuf_cmsg->frag_token is a token used to
181				 * refer to this frag for later freeing.
182				 */
183
184				struct dmabuf_token token;
185				token.token_start = dmabuf_cmsg->frag_token;
186				token.token_count = 1;
187				continue;
188			}
189
190			if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
191				/* Frag landed in linear buffer.
192				 *
193				 * dmabuf_cmsg->frag_size is the size of the
194				 * frag.
195				 */
196				continue;
197
198		}
199
200Applications may receive 2 cmsgs:
201
202- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
203  by dmabuf_id.
204
205- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
206  This typically happens when the NIC is unable to split the packet at the
207  header boundary, such that part (or all) of the payload landed in host
208  memory.
209
210Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
211regular TCP data that landed on an RX queue not bound to a dmabuf.
212
213
214Freeing frags
215-------------
216
217Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
218processes the frag. The user must return the frag to the kernel via
219SO_DEVMEM_DONTNEED::
220
221	ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
222			 sizeof(token));
223
224The user must ensure the tokens are returned to the kernel in a timely manner.
225Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
226and will lead to packet drops.
227
228The user must pass no more than 128 tokens, with no more than 1024 total frags
229among the token->token_count across all the tokens. If the user provides more
230than 1024 frags, the kernel will free up to 1024 frags and return early.
231
232The kernel returns the number of actual frags freed. The number of frags freed
233can be less than the tokens provided by the user in case of:
234
235(a) an internal kernel leak bug.
236(b) the user passed more than 1024 frags.
237
238TX Interface
239============
240
241
242Example
243-------
244
245./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of
246setting up the TX path of this API.
247
248
249NIC Setup
250---------
251
252The user must bind a TX dmabuf to a given NIC using the netlink API::
253
254        struct netdev_bind_tx_req *req = NULL;
255        struct netdev_bind_tx_rsp *rsp = NULL;
256        struct ynl_error yerr;
257
258        *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
259
260        req = netdev_bind_tx_req_alloc();
261        netdev_bind_tx_req_set_ifindex(req, ifindex);
262        netdev_bind_tx_req_set_fd(req, dmabuf_fd);
263
264        rsp = netdev_bind_tx(*ys, req);
265
266        tx_dmabuf_id = rsp->id;
267
268
269The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
270that has been bound.
271
272The user can unbind the dmabuf from the netdevice by closing the netlink socket
273that established the binding. We do this so that the binding is automatically
274unbound even if the userspace process crashes.
275
276Note that any reasonably well-behaved dmabuf from any exporter should work with
277devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
278this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
279
280Socket Setup
281------------
282
283The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem
284cannot be copied by the kernel, so the semantics of the devmem TX are similar
285to the semantics of MSG_ZEROCOPY::
286
287	setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));
288
289It is also recommended that the user binds the TX socket to the same interface
290the dma-buf has been bound to via SO_BINDTODEVICE::
291
292	setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1);
293
294
295Sending data
296------------
297
298Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg.
299
300The user should create a msghdr where,
301
302* iov_base is set to the offset into the dmabuf to start sending from
303* iov_len is set to the number of bytes to be sent from the dmabuf
304
305The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id.
306
307The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048
308from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id::
309
310       char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))];
311       struct dmabuf_tx_cmsg ddmabuf;
312       struct msghdr msg = {};
313       struct cmsghdr *cmsg;
314       struct iovec iov[2];
315
316       iov[0].iov_base = (void*)100;
317       iov[0].iov_len = 1024;
318       iov[1].iov_base = (void*)2000;
319       iov[1].iov_len = 2048;
320
321       msg.msg_iov = iov;
322       msg.msg_iovlen = 2;
323
324       msg.msg_control = ctrl_data;
325       msg.msg_controllen = sizeof(ctrl_data);
326
327       cmsg = CMSG_FIRSTHDR(&msg);
328       cmsg->cmsg_level = SOL_SOCKET;
329       cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
330       cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg));
331
332       ddmabuf.dmabuf_id = tx_dmabuf_id;
333
334       *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf;
335
336       sendmsg(socket_fd, &msg, MSG_ZEROCOPY);
337
338
339Reusing TX dmabufs
340------------------
341
342Similar to MSG_ZEROCOPY with regular memory, the user should not modify the
343contents of the dma-buf while a send operation is in progress. This is because
344the kernel does not keep a copy of the dmabuf contents. Instead, the kernel
345will pin and send data from the buffer available to the userspace.
346
347Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions
348using MSG_ERRQUEUE::
349
350        int64_t tstop = gettimeofday_ms() + waittime_ms;
351        char control[CMSG_SPACE(100)] = {};
352        struct sock_extended_err *serr;
353        struct msghdr msg = {};
354        struct cmsghdr *cm;
355        int retries = 10;
356        __u32 hi, lo;
357
358        msg.msg_control = control;
359        msg.msg_controllen = sizeof(control);
360
361        while (gettimeofday_ms() < tstop) {
362                if (!do_poll(fd)) continue;
363
364                ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
365
366                for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
367                        serr = (void *)CMSG_DATA(cm);
368
369                        hi = serr->ee_data;
370                        lo = serr->ee_info;
371
372                        fprintf(stdout, "tx complete [%d,%d]\n", lo, hi);
373                }
374        }
375
376After the associated sendmsg has been completed, the dmabuf can be reused by
377the userspace.
378
379
380Implementation & Caveats
381========================
382
383Unreadable skbs
384---------------
385
386Devmem payloads are inaccessible to the kernel processing the packets. This
387results in a few quirks for payloads of devmem skbs:
388
389- Loopback is not functional. Loopback relies on copying the payload, which is
390  not possible with devmem skbs.
391
392- Software checksum calculation fails.
393
394- TCP Dump and bpf can't access devmem packet payloads.
395
396
397Testing
398=======
399
400More realistic example code can be found in the kernel source under
401``tools/testing/selftests/drivers/net/hw/ncdevmem.c``
402
403ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
404receives data directly into a udmabuf.
405
406To run ncdevmem, you need to run it on a server on the machine under test, and
407you need to run netcat on a peer to provide the TX data.
408
409ncdevmem has a validation mode as well that expects a repeating pattern of
410incoming data and validates it as such. For example, you can launch
411ncdevmem on the server by::
412
413	ncdevmem -s <server IP> -c <client IP> -f <ifname> -l -p 5201 -v 7
414
415On client side, use regular netcat to send TX data to ncdevmem process
416on the server::
417
418	yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
419		tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
420