1 2============ 3MSG_ZEROCOPY 4============ 5 6Intro 7===== 8 9The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. 10The feature is currently implemented for TCP sockets. 11 12 13Opportunity and Caveats 14----------------------- 15 16Copying large buffers between user process and kernel can be 17expensive. Linux supports various interfaces that eschew copying, 18such as sendpage and splice. The MSG_ZEROCOPY flag extends the 19underlying copy avoidance mechanism to common socket send calls. 20 21Copy avoidance is not a free lunch. As implemented, with page pinning, 22it replaces per byte copy cost with page accounting and completion 23notification overhead. As a result, MSG_ZEROCOPY is generally only 24effective at writes over around 10 KB. 25 26Page pinning also changes system call semantics. It temporarily shares 27the buffer between process and network stack. Unlike with copying, the 28process cannot immediately overwrite the buffer after system call 29return without possibly modifying the data in flight. Kernel integrity 30is not affected, but a buggy program can possibly corrupt its own data 31stream. 32 33The kernel returns a notification when it is safe to modify data. 34Converting an existing application to MSG_ZEROCOPY is not always as 35trivial as just passing the flag, then. 36 37 38More Info 39--------- 40 41Much of this document was derived from a longer paper presented at 42netdev 2.1. For more in-depth information see that paper and talk, 43the excellent reporting over at LWN.net or read the original code. 44 45 paper, slides, video 46 https://netdevconf.org/2.1/session.html?debruijn 47 48 LWN article 49 https://lwn.net/Articles/726917/ 50 51 patchset 52 [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY 53 http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com 54 55 56Interface 57========= 58 59Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy 60avoidance, but not the only one. 61 62Socket Setup 63------------ 64 65The kernel is permissive when applications pass undefined flags to the 66send system call. By default it simply ignores these. To avoid enabling 67copy avoidance mode for legacy processes that accidentally already pass 68this flag, a process must first signal intent by setting a socket option: 69 70:: 71 72 if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) 73 error(1, errno, "setsockopt zerocopy"); 74 75Transmission 76------------ 77 78The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. 79Pass the new flag. 80 81:: 82 83 ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); 84 85A zerocopy failure will return -1 with errno ENOBUFS. This happens if 86the socket option was not set, the socket exceeds its optmem limit or 87the user exceeds its ulimit on locked pages. 88 89 90Mixing copy avoidance and copying 91~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 92 93Many workloads have a mixture of large and small buffers. Because copy 94avoidance is more expensive than copying for small packets, the 95feature is implemented as a flag. It is safe to mix calls with the flag 96with those without. 97 98 99Notifications 100------------- 101 102The kernel has to notify the process when it is safe to reuse a 103previously passed buffer. It queues completion notifications on the 104socket error queue, akin to the transmit timestamping interface. 105 106The notification itself is a simple scalar value. Each socket 107maintains an internal unsigned 32-bit counter. Each send call with 108MSG_ZEROCOPY that successfully sends data increments the counter. The 109counter is not incremented on failure or if called with length zero. 110The counter counts system call invocations, not bytes. It wraps after 111UINT_MAX calls. 112 113 114Notification Reception 115~~~~~~~~~~~~~~~~~~~~~~ 116 117The below snippet demonstrates the API. In the simplest case, each 118send syscall is followed by a poll and recvmsg on the error queue. 119 120Reading from the error queue is always a non-blocking operation. The 121poll call is there to block until an error is outstanding. It will set 122POLLERR in its output flags. That flag does not have to be set in the 123events field. Errors are signaled unconditionally. 124 125:: 126 127 pfd.fd = fd; 128 pfd.events = 0; 129 if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) 130 error(1, errno, "poll"); 131 132 ret = recvmsg(fd, &msg, MSG_ERRQUEUE); 133 if (ret == -1) 134 error(1, errno, "recvmsg"); 135 136 read_notification(msg); 137 138The example is for demonstration purpose only. In practice, it is more 139efficient to not wait for notifications, but read without blocking 140every couple of send calls. 141 142Notifications can be processed out of order with other operations on 143the socket. A socket that has an error queued would normally block 144other operations until the error is read. Zerocopy notifications have 145a zero error code, however, to not block send and recv calls. 146 147 148Notification Batching 149~~~~~~~~~~~~~~~~~~~~~ 150 151Multiple outstanding packets can be read at once using the recvmmsg 152call. This is often not needed. In each message the kernel returns not 153a single value, but a range. It coalesces consecutive notifications 154while one is outstanding for reception on the error queue. 155 156When a new notification is about to be queued, it checks whether the 157new value extends the range of the notification at the tail of the 158queue. If so, it drops the new notification packet and instead increases 159the range upper value of the outstanding notification. 160 161For protocols that acknowledge data in-order, like TCP, each 162notification can be squashed into the previous one, so that no more 163than one notification is outstanding at any one point. 164 165Ordered delivery is the common case, but not guaranteed. Notifications 166may arrive out of order on retransmission and socket teardown. 167 168 169Notification Parsing 170~~~~~~~~~~~~~~~~~~~~ 171 172The below snippet demonstrates how to parse the control message: the 173read_notification() call in the previous snippet. A notification 174is encoded in the standard error format, sock_extended_err. 175 176The level and type fields in the control data are protocol family 177specific, IP_RECVERR or IPV6_RECVERR. 178 179Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, 180as explained before, to avoid blocking read and write system calls on 181the socket. 182 183The 32-bit notification range is encoded as [ee_info, ee_data]. This 184range is inclusive. Other fields in the struct must be treated as 185undefined, bar for ee_code, as discussed below. 186 187:: 188 189 struct sock_extended_err *serr; 190 struct cmsghdr *cm; 191 192 cm = CMSG_FIRSTHDR(msg); 193 if (cm->cmsg_level != SOL_IP && 194 cm->cmsg_type != IP_RECVERR) 195 error(1, 0, "cmsg"); 196 197 serr = (void *) CMSG_DATA(cm); 198 if (serr->ee_errno != 0 || 199 serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) 200 error(1, 0, "serr"); 201 202 printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); 203 204 205Deferred copies 206~~~~~~~~~~~~~~~ 207 208Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy 209avoidance, and a contract that the kernel will queue a completion 210notification. It is not a guarantee that the copy is elided. 211 212Copy avoidance is not always feasible. Devices that do not support 213scatter-gather I/O cannot send packets made up of kernel generated 214protocol headers plus zerocopy user data. A packet may need to be 215converted to a private copy of data deep in the stack, say to compute 216a checksum. 217 218In all these cases, the kernel returns a completion notification when 219it releases its hold on the shared pages. That notification may arrive 220before the (copied) data is fully transmitted. A zerocopy completion 221notification is not a transmit completion notification, therefore. 222 223Deferred copies can be more expensive than a copy immediately in the 224system call, if the data is no longer warm in the cache. The process 225also incurs notification processing cost for no benefit. For this 226reason, the kernel signals if data was completed with a copy, by 227setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. 228A process may use this signal to stop passing flag MSG_ZEROCOPY on 229subsequent requests on the same socket. 230 231 232Implementation 233============== 234 235Loopback 236-------- 237 238Data sent to local sockets can be queued indefinitely if the receive 239process does not read its socket. Unbound notification latency is not 240acceptable. For this reason all packets generated with MSG_ZEROCOPY 241that are looped to a local socket will incur a deferred copy. This 242includes looping onto packet sockets (e.g., tcpdump) and tun devices. 243 244 245Testing 246======= 247 248More realistic example code can be found in the kernel source under 249tools/testing/selftests/net/msg_zerocopy.c. 250 251Be cognizant of the loopback constraint. The test can be run between 252a pair of hosts. But if run between a local pair of processes, for 253instance when run with msg_zerocopy.sh between a veth pair across 254namespaces, the test will not show any improvement. For testing, the 255loopback restriction can be temporarily relaxed by making 256skb_orphan_frags_rx identical to skb_orphan_frags. 257