1.. SPDX-License-Identifier: GPL-2.0 2 3================= 4Device Memory TCP 5================= 6 7 8Intro 9===== 10 11Device memory TCP (devmem TCP) enables receiving data directly into device 12memory (dmabuf). The feature is currently implemented for TCP sockets. 13 14 15Opportunity 16----------- 17 18A large number of data transfers have device memory as the source and/or 19destination. Accelerators drastically increased the prevalence of such 20transfers. Some examples include: 21 22- Distributed training, where ML accelerators, such as GPUs on different hosts, 23 exchange data. 24 25- Distributed raw block storage applications transfer large amounts of data with 26 remote SSDs. Much of this data does not require host processing. 27 28Typically the Device-to-Device data transfers in the network are implemented as 29the following low-level operations: Device-to-Host copy, Host-to-Host network 30transfer, and Host-to-Device copy. 31 32The flow involving host copies is suboptimal, especially for bulk data transfers, 33and can put significant strains on system resources such as host memory 34bandwidth and PCIe bandwidth. 35 36Devmem TCP optimizes this use case by implementing socket APIs that enable 37the user to receive incoming network packets directly into device memory. 38 39Packet payloads go directly from the NIC to device memory. 40 41Packet headers go to host memory and are processed by the TCP/IP stack 42normally. The NIC must support header split to achieve this. 43 44Advantages: 45 46- Alleviate host memory bandwidth pressure, compared to existing 47 network-transfer + device-copy semantics. 48 49- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest 50 level of the PCIe tree, compared to the traditional path which sends data 51 through the root complex. 52 53 54More Info 55--------- 56 57 slides, video 58 https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html 59 60 patchset 61 [PATCH net-next v24 00/13] Device Memory TCP 62 https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/ 63 64 65RX Interface 66============ 67 68 69Example 70------- 71 72./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of 73setting up the RX path of this API. 74 75 76NIC Setup 77--------- 78 79Header split, flow steering, & RSS are required features for devmem TCP. 80 81Header split is used to split incoming packets into a header buffer in host 82memory, and a payload buffer in device memory. 83 84Flow steering & RSS are used to ensure that only flows targeting devmem land on 85an RX queue bound to devmem. 86 87Enable header split & flow steering:: 88 89 # enable header split 90 ethtool -G eth1 tcp-data-split on 91 92 93 # enable flow steering 94 ethtool -K eth1 ntuple on 95 96Configure RSS to steer all traffic away from the target RX queue (queue 15 in 97this example):: 98 99 ethtool --set-rxfh-indir eth1 equal 15 100 101 102The user must bind a dmabuf to any number of RX queues on a given NIC using 103the netlink API:: 104 105 /* Bind dmabuf to NIC RX queue 15 */ 106 struct netdev_queue_id *queues; 107 108 queues = netdev_queue_id_alloc(1); 109 netdev_queue_id_set_type(&queues[0], NETDEV_QUEUE_TYPE_RX); 110 netdev_queue_id_set_id(&queues[0], 15); 111 112 *ys = ynl_sock_create(&ynl_netdev_family, &yerr); 113 114 req = netdev_bind_rx_req_alloc(); 115 netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */); 116 netdev_bind_rx_req_set_fd(req, dmabuf_fd); 117 __netdev_bind_rx_req_set_queues(req, queues, 1); 118 119 rsp = netdev_bind_rx(*ys, req); 120 121 dmabuf_id = rsp->id; 122 123 124The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf 125that has been bound. 126 127The user can unbind the dmabuf from the netdevice by closing the netlink socket 128that established the binding. We do this so that the binding is automatically 129unbound even if the userspace process crashes. 130 131Note that any reasonably well-behaved dmabuf from any exporter should work with 132devmem TCP, even if the dmabuf is not actually backed by devmem. An example of 133this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. 134 135 136Socket Setup 137------------ 138 139The socket must be flow steered to the dmabuf bound RX queue:: 140 141 ethtool -N eth1 flow-type tcp4 ... queue 15 142 143 144Receiving data 145-------------- 146 147The user application must signal to the kernel that it is capable of receiving 148devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg:: 149 150 ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM); 151 152Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT 153on devmem data. 154 155Devmem data is received directly into the dmabuf bound to the NIC in 'NIC 156Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs:: 157 158 for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { 159 if (cm->cmsg_level != SOL_SOCKET || 160 (cm->cmsg_type != SCM_DEVMEM_DMABUF && 161 cm->cmsg_type != SCM_DEVMEM_LINEAR)) 162 continue; 163 164 dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm); 165 166 if (cm->cmsg_type == SCM_DEVMEM_DMABUF) { 167 /* Frag landed in dmabuf. 168 * 169 * dmabuf_cmsg->dmabuf_id is the dmabuf the 170 * frag landed on. 171 * 172 * dmabuf_cmsg->frag_offset is the offset into 173 * the dmabuf where the frag starts. 174 * 175 * dmabuf_cmsg->frag_size is the size of the 176 * frag. 177 * 178 * dmabuf_cmsg->frag_token is a token used to 179 * refer to this frag for later freeing. 180 */ 181 182 struct dmabuf_token token; 183 token.token_start = dmabuf_cmsg->frag_token; 184 token.token_count = 1; 185 continue; 186 } 187 188 if (cm->cmsg_type == SCM_DEVMEM_LINEAR) 189 /* Frag landed in linear buffer. 190 * 191 * dmabuf_cmsg->frag_size is the size of the 192 * frag. 193 */ 194 continue; 195 196 } 197 198Applications may receive 2 cmsgs: 199 200- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated 201 by dmabuf_id. 202 203- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer. 204 This typically happens when the NIC is unable to split the packet at the 205 header boundary, such that part (or all) of the payload landed in host 206 memory. 207 208Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem, 209regular TCP data that landed on an RX queue not bound to a dmabuf. 210 211 212Freeing frags 213------------- 214 215Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user 216processes the frag. The user must return the frag to the kernel via 217SO_DEVMEM_DONTNEED:: 218 219 ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token, 220 sizeof(token)); 221 222The user must ensure the tokens are returned to the kernel in a timely manner. 223Failure to do so will exhaust the limited dmabuf that is bound to the RX queue 224and will lead to packet drops. 225 226The user must pass no more than 128 tokens, with no more than 1024 total frags 227among the token->token_count across all the tokens. If the user provides more 228than 1024 frags, the kernel will free up to 1024 frags and return early. 229 230The kernel returns the number of actual frags freed. The number of frags freed 231can be less than the tokens provided by the user in case of: 232 233(a) an internal kernel leak bug. 234(b) the user passed more than 1024 frags. 235 236TX Interface 237============ 238 239 240Example 241------- 242 243./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of 244setting up the TX path of this API. 245 246 247NIC Setup 248--------- 249 250The user must bind a TX dmabuf to a given NIC using the netlink API:: 251 252 struct netdev_bind_tx_req *req = NULL; 253 struct netdev_bind_tx_rsp *rsp = NULL; 254 struct ynl_error yerr; 255 256 *ys = ynl_sock_create(&ynl_netdev_family, &yerr); 257 258 req = netdev_bind_tx_req_alloc(); 259 netdev_bind_tx_req_set_ifindex(req, ifindex); 260 netdev_bind_tx_req_set_fd(req, dmabuf_fd); 261 262 rsp = netdev_bind_tx(*ys, req); 263 264 tx_dmabuf_id = rsp->id; 265 266 267The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf 268that has been bound. 269 270The user can unbind the dmabuf from the netdevice by closing the netlink socket 271that established the binding. We do this so that the binding is automatically 272unbound even if the userspace process crashes. 273 274Note that any reasonably well-behaved dmabuf from any exporter should work with 275devmem TCP, even if the dmabuf is not actually backed by devmem. An example of 276this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. 277 278Socket Setup 279------------ 280 281The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem 282cannot be copied by the kernel, so the semantics of the devmem TX are similar 283to the semantics of MSG_ZEROCOPY:: 284 285 setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); 286 287It is also recommended that the user binds the TX socket to the same interface 288the dma-buf has been bound to via SO_BINDTODEVICE:: 289 290 setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1); 291 292 293Sending data 294------------ 295 296Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg. 297 298The user should create a msghdr where, 299 300* iov_base is set to the offset into the dmabuf to start sending from 301* iov_len is set to the number of bytes to be sent from the dmabuf 302 303The user passes the dma-buf id to send from as a u32 cmsg payload. 304 305The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048 306from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id:: 307 308 char ctrl_data[CMSG_SPACE(sizeof(__u32))]; 309 struct msghdr msg = {}; 310 struct cmsghdr *cmsg; 311 struct iovec iov[2]; 312 313 iov[0].iov_base = (void*)100; 314 iov[0].iov_len = 1024; 315 iov[1].iov_base = (void*)2000; 316 iov[1].iov_len = 2048; 317 318 msg.msg_iov = iov; 319 msg.msg_iovlen = 2; 320 321 msg.msg_control = ctrl_data; 322 msg.msg_controllen = sizeof(ctrl_data); 323 324 cmsg = CMSG_FIRSTHDR(&msg); 325 cmsg->cmsg_level = SOL_SOCKET; 326 cmsg->cmsg_type = SCM_DEVMEM_DMABUF; 327 cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); 328 329 *((__u32 *)CMSG_DATA(cmsg)) = tx_dmabuf_id; 330 331 sendmsg(socket_fd, &msg, MSG_ZEROCOPY); 332 333 334Reusing TX dmabufs 335------------------ 336 337Similar to MSG_ZEROCOPY with regular memory, the user should not modify the 338contents of the dma-buf while a send operation is in progress. This is because 339the kernel does not keep a copy of the dmabuf contents. Instead, the kernel 340will pin and send data from the buffer available to the userspace. 341 342Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions 343using MSG_ERRQUEUE:: 344 345 int64_t tstop = gettimeofday_ms() + waittime_ms; 346 char control[CMSG_SPACE(100)] = {}; 347 struct sock_extended_err *serr; 348 struct msghdr msg = {}; 349 struct cmsghdr *cm; 350 int retries = 10; 351 __u32 hi, lo; 352 353 msg.msg_control = control; 354 msg.msg_controllen = sizeof(control); 355 356 while (gettimeofday_ms() < tstop) { 357 if (!do_poll(fd)) continue; 358 359 ret = recvmsg(fd, &msg, MSG_ERRQUEUE); 360 361 for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { 362 serr = (void *)CMSG_DATA(cm); 363 364 hi = serr->ee_data; 365 lo = serr->ee_info; 366 367 fprintf(stdout, "tx complete [%d,%d]\n", lo, hi); 368 } 369 } 370 371After the associated sendmsg has been completed, the dmabuf can be reused by 372the userspace. 373 374 375Implementation & Caveats 376======================== 377 378Unreadable skbs 379--------------- 380 381Devmem payloads are inaccessible to the kernel processing the packets. This 382results in a few quirks for payloads of devmem skbs: 383 384- Loopback is not functional. Loopback relies on copying the payload, which is 385 not possible with devmem skbs. 386 387- Software checksum calculation fails. 388 389- TCP Dump and bpf can't access devmem packet payloads. 390 391 392Testing 393======= 394 395More realistic example code can be found in the kernel source under 396``tools/testing/selftests/drivers/net/hw/ncdevmem.c`` 397 398ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but 399receives data directly into a udmabuf. 400 401To run ncdevmem, you need to run it on a server on the machine under test, and 402you need to run netcat on a peer to provide the TX data. 403 404ncdevmem has a validation mode as well that expects a repeating pattern of 405incoming data and validates it as such. For example, you can launch 406ncdevmem on the server by:: 407 408 ncdevmem -s <server IP> -c <client IP> -f <ifname> -l -p 5201 -v 7 409 410On client side, use regular netcat to send TX data to ncdevmem process 411on the server:: 412 413 yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \ 414 tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201 415