xref: /linux/Documentation/networking/devmem.rst (revision 8d72997dab65b1e9e3220302e26eaecd9b99c02f)
1.. SPDX-License-Identifier: GPL-2.0
2
3=================
4Device Memory TCP
5=================
6
7
8Intro
9=====
10
11Device memory TCP (devmem TCP) enables receiving data directly into device
12memory (dmabuf). The feature is currently implemented for TCP sockets.
13
14
15Opportunity
16-----------
17
18A large number of data transfers have device memory as the source and/or
19destination. Accelerators drastically increased the prevalence of such
20transfers.  Some examples include:
21
22- Distributed training, where ML accelerators, such as GPUs on different hosts,
23  exchange data.
24
25- Distributed raw block storage applications transfer large amounts of data with
26  remote SSDs. Much of this data does not require host processing.
27
28Typically the Device-to-Device data transfers in the network are implemented as
29the following low-level operations: Device-to-Host copy, Host-to-Host network
30transfer, and Host-to-Device copy.
31
32The flow involving host copies is suboptimal, especially for bulk data transfers,
33and can put significant strains on system resources such as host memory
34bandwidth and PCIe bandwidth.
35
36Devmem TCP optimizes this use case by implementing socket APIs that enable
37the user to receive incoming network packets directly into device memory.
38
39Packet payloads go directly from the NIC to device memory.
40
41Packet headers go to host memory and are processed by the TCP/IP stack
42normally. The NIC must support header split to achieve this.
43
44Advantages:
45
46- Alleviate host memory bandwidth pressure, compared to existing
47  network-transfer + device-copy semantics.
48
49- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
50  level of the PCIe tree, compared to the traditional path which sends data
51  through the root complex.
52
53
54More Info
55---------
56
57  slides, video
58    https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
59
60  patchset
61    [PATCH net-next v24 00/13] Device Memory TCP
62    https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
63
64
65RX Interface
66============
67
68
69Example
70-------
71
72./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of
73setting up the RX path of this API.
74
75
76NIC Setup
77---------
78
79Header split, flow steering, & RSS are required features for devmem TCP.
80
81Header split is used to split incoming packets into a header buffer in host
82memory, and a payload buffer in device memory.
83
84Flow steering & RSS are used to ensure that only flows targeting devmem land on
85an RX queue bound to devmem.
86
87Enable header split & flow steering::
88
89	# enable header split
90	ethtool -G eth1 tcp-data-split on
91
92
93	# enable flow steering
94	ethtool -K eth1 ntuple on
95
96Configure RSS to steer all traffic away from the target RX queue (queue 15 in
97this example)::
98
99	ethtool --set-rxfh-indir eth1 equal 15
100
101
102The user must bind a dmabuf to any number of RX queues on a given NIC using
103the netlink API::
104
105	/* Bind dmabuf to NIC RX queue 15 */
106	struct netdev_queue_id *queues;
107
108	queues = netdev_queue_id_alloc(1);
109	netdev_queue_id_set_type(&queues[0], NETDEV_QUEUE_TYPE_RX);
110	netdev_queue_id_set_id(&queues[0], 15);
111
112	*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
113
114	req = netdev_bind_rx_req_alloc();
115	netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
116	netdev_bind_rx_req_set_fd(req, dmabuf_fd);
117	__netdev_bind_rx_req_set_queues(req, queues, 1);
118
119	rsp = netdev_bind_rx(*ys, req);
120
121	dmabuf_id = rsp->id;
122
123
124The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
125that has been bound.
126
127The user can unbind the dmabuf from the netdevice by closing the netlink socket
128that established the binding. We do this so that the binding is automatically
129unbound even if the userspace process crashes.
130
131Note that any reasonably well-behaved dmabuf from any exporter should work with
132devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
133this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
134
135
136Socket Setup
137------------
138
139The socket must be flow steered to the dmabuf bound RX queue::
140
141	ethtool -N eth1 flow-type tcp4 ... queue 15
142
143
144Receiving data
145--------------
146
147The user application must signal to the kernel that it is capable of receiving
148devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
149
150	ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
151
152Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
153on devmem data.
154
155Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
156Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
157
158		for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
159			if (cm->cmsg_level != SOL_SOCKET ||
160				(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
161				 cm->cmsg_type != SCM_DEVMEM_LINEAR))
162				continue;
163
164			dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
165
166			if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
167				/* Frag landed in dmabuf.
168				 *
169				 * dmabuf_cmsg->dmabuf_id is the dmabuf the
170				 * frag landed on.
171				 *
172				 * dmabuf_cmsg->frag_offset is the offset into
173				 * the dmabuf where the frag starts.
174				 *
175				 * dmabuf_cmsg->frag_size is the size of the
176				 * frag.
177				 *
178				 * dmabuf_cmsg->frag_token is a token used to
179				 * refer to this frag for later freeing.
180				 */
181
182				struct dmabuf_token token;
183				token.token_start = dmabuf_cmsg->frag_token;
184				token.token_count = 1;
185				continue;
186			}
187
188			if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
189				/* Frag landed in linear buffer.
190				 *
191				 * dmabuf_cmsg->frag_size is the size of the
192				 * frag.
193				 */
194				continue;
195
196		}
197
198Applications may receive 2 cmsgs:
199
200- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
201  by dmabuf_id.
202
203- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
204  This typically happens when the NIC is unable to split the packet at the
205  header boundary, such that part (or all) of the payload landed in host
206  memory.
207
208Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
209regular TCP data that landed on an RX queue not bound to a dmabuf.
210
211
212Freeing frags
213-------------
214
215Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
216processes the frag. The user must return the frag to the kernel via
217SO_DEVMEM_DONTNEED::
218
219	ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
220			 sizeof(token));
221
222The user must ensure the tokens are returned to the kernel in a timely manner.
223Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
224and will lead to packet drops.
225
226The user must pass no more than 128 tokens, with no more than 1024 total frags
227among the token->token_count across all the tokens. If the user provides more
228than 1024 frags, the kernel will free up to 1024 frags and return early.
229
230The kernel returns the number of actual frags freed. The number of frags freed
231can be less than the tokens provided by the user in case of:
232
233(a) an internal kernel leak bug.
234(b) the user passed more than 1024 frags.
235
236TX Interface
237============
238
239
240Example
241-------
242
243./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of
244setting up the TX path of this API.
245
246
247NIC Setup
248---------
249
250The user must bind a TX dmabuf to a given NIC using the netlink API::
251
252        struct netdev_bind_tx_req *req = NULL;
253        struct netdev_bind_tx_rsp *rsp = NULL;
254        struct ynl_error yerr;
255
256        *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
257
258        req = netdev_bind_tx_req_alloc();
259        netdev_bind_tx_req_set_ifindex(req, ifindex);
260        netdev_bind_tx_req_set_fd(req, dmabuf_fd);
261
262        rsp = netdev_bind_tx(*ys, req);
263
264        tx_dmabuf_id = rsp->id;
265
266
267The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
268that has been bound.
269
270The user can unbind the dmabuf from the netdevice by closing the netlink socket
271that established the binding. We do this so that the binding is automatically
272unbound even if the userspace process crashes.
273
274Note that any reasonably well-behaved dmabuf from any exporter should work with
275devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
276this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
277
278Socket Setup
279------------
280
281The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem
282cannot be copied by the kernel, so the semantics of the devmem TX are similar
283to the semantics of MSG_ZEROCOPY::
284
285	setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));
286
287It is also recommended that the user binds the TX socket to the same interface
288the dma-buf has been bound to via SO_BINDTODEVICE::
289
290	setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1);
291
292
293Sending data
294------------
295
296Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg.
297
298The user should create a msghdr where,
299
300* iov_base is set to the offset into the dmabuf to start sending from
301* iov_len is set to the number of bytes to be sent from the dmabuf
302
303The user passes the dma-buf id to send from as a u32 cmsg payload.
304
305The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048
306from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id::
307
308       char ctrl_data[CMSG_SPACE(sizeof(__u32))];
309       struct msghdr msg = {};
310       struct cmsghdr *cmsg;
311       struct iovec iov[2];
312
313       iov[0].iov_base = (void*)100;
314       iov[0].iov_len = 1024;
315       iov[1].iov_base = (void*)2000;
316       iov[1].iov_len = 2048;
317
318       msg.msg_iov = iov;
319       msg.msg_iovlen = 2;
320
321       msg.msg_control = ctrl_data;
322       msg.msg_controllen = sizeof(ctrl_data);
323
324       cmsg = CMSG_FIRSTHDR(&msg);
325       cmsg->cmsg_level = SOL_SOCKET;
326       cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
327       cmsg->cmsg_len = CMSG_LEN(sizeof(__u32));
328
329       *((__u32 *)CMSG_DATA(cmsg)) = tx_dmabuf_id;
330
331       sendmsg(socket_fd, &msg, MSG_ZEROCOPY);
332
333
334Reusing TX dmabufs
335------------------
336
337Similar to MSG_ZEROCOPY with regular memory, the user should not modify the
338contents of the dma-buf while a send operation is in progress. This is because
339the kernel does not keep a copy of the dmabuf contents. Instead, the kernel
340will pin and send data from the buffer available to the userspace.
341
342Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions
343using MSG_ERRQUEUE::
344
345        int64_t tstop = gettimeofday_ms() + waittime_ms;
346        char control[CMSG_SPACE(100)] = {};
347        struct sock_extended_err *serr;
348        struct msghdr msg = {};
349        struct cmsghdr *cm;
350        int retries = 10;
351        __u32 hi, lo;
352
353        msg.msg_control = control;
354        msg.msg_controllen = sizeof(control);
355
356        while (gettimeofday_ms() < tstop) {
357                if (!do_poll(fd)) continue;
358
359                ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
360
361                for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
362                        serr = (void *)CMSG_DATA(cm);
363
364                        hi = serr->ee_data;
365                        lo = serr->ee_info;
366
367                        fprintf(stdout, "tx complete [%d,%d]\n", lo, hi);
368                }
369        }
370
371After the associated sendmsg has been completed, the dmabuf can be reused by
372the userspace.
373
374
375Implementation & Caveats
376========================
377
378Unreadable skbs
379---------------
380
381Devmem payloads are inaccessible to the kernel processing the packets. This
382results in a few quirks for payloads of devmem skbs:
383
384- Loopback is not functional. Loopback relies on copying the payload, which is
385  not possible with devmem skbs.
386
387- Software checksum calculation fails.
388
389- TCP Dump and bpf can't access devmem packet payloads.
390
391
392Testing
393=======
394
395More realistic example code can be found in the kernel source under
396``tools/testing/selftests/drivers/net/hw/ncdevmem.c``
397
398ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
399receives data directly into a udmabuf.
400
401To run ncdevmem, you need to run it on a server on the machine under test, and
402you need to run netcat on a peer to provide the TX data.
403
404ncdevmem has a validation mode as well that expects a repeating pattern of
405incoming data and validates it as such. For example, you can launch
406ncdevmem on the server by::
407
408	ncdevmem -s <server IP> -c <client IP> -f <ifname> -l -p 5201 -v 7
409
410On client side, use regular netcat to send TX data to ncdevmem process
411on the server::
412
413	yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
414		tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
415