1*17885a7bSLuigi Rizzo.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 268b8534bSLuigi Rizzo.\" All rights reserved. 368b8534bSLuigi Rizzo.\" 468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without 568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions 668b8534bSLuigi Rizzo.\" are met: 768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright 868b8534bSLuigi Rizzo.\" notice, this list of conditions and the following disclaimer. 968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright 1068b8534bSLuigi Rizzo.\" notice, this list of conditions and the following disclaimer in the 1168b8534bSLuigi Rizzo.\" documentation and/or other materials provided with the distribution. 1268b8534bSLuigi Rizzo.\" 1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 2368b8534bSLuigi Rizzo.\" SUCH DAMAGE. 2468b8534bSLuigi Rizzo.\" 2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4) 2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix. 2768b8534bSLuigi Rizzo.\" 2868b8534bSLuigi Rizzo.\" $FreeBSD$ 2968b8534bSLuigi Rizzo.\" 30*17885a7bSLuigi Rizzo.Dd January 4, 2014 3168b8534bSLuigi Rizzo.Dt NETMAP 4 3268b8534bSLuigi Rizzo.Os 3368b8534bSLuigi Rizzo.Sh NAME 3468b8534bSLuigi Rizzo.Nm netmap 3568b8534bSLuigi Rizzo.Nd a framework for fast packet I/O 36*17885a7bSLuigi Rizzo.br 37*17885a7bSLuigi Rizzo.Nm VALE 38*17885a7bSLuigi Rizzo.Nd a fast VirtuAl Local Ethernet using the netmap API 3968b8534bSLuigi Rizzo.Sh SYNOPSIS 4068b8534bSLuigi Rizzo.Cd device netmap 4168b8534bSLuigi Rizzo.Sh DESCRIPTION 4268b8534bSLuigi Rizzo.Nm 43ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O 44ce3ee1e7SLuigi Rizzofor both userspace and kernel clients. 45*17885a7bSLuigi RizzoIt runs on FreeBSD and Linux, 46*17885a7bSLuigi Rizzoand includes 47*17885a7bSLuigi Rizzo.Nm VALE , 48*17885a7bSLuigi Rizzoa very fast and modular in-kernel software switch/dataplane. 49ce3ee1e7SLuigi Rizzo.Pp 50*17885a7bSLuigi Rizzo.Nm 51*17885a7bSLuigi Rizzoand 52ce3ee1e7SLuigi Rizzo.Nm VALE 53*17885a7bSLuigi Rizzoare one order of magnitude faster than sockets, bpf or 54*17885a7bSLuigi Rizzonative switches based on 55*17885a7bSLuigi Rizzo.Xr tun/tap 4 , 56*17885a7bSLuigi Rizzoreaching 14.88 Mpps with much less than one core on a 10 Gbit NIC, 57*17885a7bSLuigi Rizzoand 20 Mpps per core for VALE ports. 58ce3ee1e7SLuigi Rizzo.Pp 59*17885a7bSLuigi RizzoUserspace clients can dynamically switch NICs into 6068b8534bSLuigi Rizzo.Nm 61*17885a7bSLuigi Rizzomode and send and receive raw packets through 62*17885a7bSLuigi Rizzomemory mapped buffers. 63*17885a7bSLuigi RizzoA selectable file descriptor supports 64*17885a7bSLuigi Rizzosynchronization and blocking I/O. 6568b8534bSLuigi Rizzo.Pp 66*17885a7bSLuigi RizzoSimilarly, 67*17885a7bSLuigi Rizzo.Nm VALE 68*17885a7bSLuigi Rizzocan dynamically create switch instances and ports, 69*17885a7bSLuigi Rizzoproviding high speed packet I/O between processes, 70*17885a7bSLuigi Rizzovirtual machines, NICs and the host stack. 71*17885a7bSLuigi Rizzo.Pp 72*17885a7bSLuigi RizzoFor best performance, 7368b8534bSLuigi Rizzo.Nm 74ce3ee1e7SLuigi Rizzorequires explicit support in device drivers; 75*17885a7bSLuigi Rizzohowever, the 7668b8534bSLuigi Rizzo.Nm 77*17885a7bSLuigi RizzoAPI can be emulated on top of unmodified device drivers, 78ce3ee1e7SLuigi Rizzoat the price of reduced performance 79*17885a7bSLuigi Rizzo(but still better than sockets or BPF/pcap). 80ce3ee1e7SLuigi Rizzo.Pp 81*17885a7bSLuigi RizzoIn the rest of this (long) manual page we document 82*17885a7bSLuigi Rizzovarious aspects of the 83ce3ee1e7SLuigi Rizzo.Nm 84*17885a7bSLuigi Rizzoand 85ce3ee1e7SLuigi Rizzo.Nm VALE 86*17885a7bSLuigi Rizzoarchitecture, features and usage. 87ce3ee1e7SLuigi Rizzo.Pp 88*17885a7bSLuigi Rizzo.Sh ARCHITECTURE 89*17885a7bSLuigi Rizzo.Nm 90*17885a7bSLuigi Rizzosupports raw packet I/O through a 91*17885a7bSLuigi Rizzo.Em port , 92*17885a7bSLuigi Rizzowhich can be connected to a physical interface 93*17885a7bSLuigi Rizzo.Em ( NIC ) , 94*17885a7bSLuigi Rizzoto the host stack, 95*17885a7bSLuigi Rizzoor to a 96*17885a7bSLuigi Rizzo.Nm VALE 97*17885a7bSLuigi Rizzoswitch). 98*17885a7bSLuigi RizzoPorts use preallocated circular queues of buffers 99*17885a7bSLuigi Rizzo.Em ( rings ) 100*17885a7bSLuigi Rizzoresiding in an mmapped region. 101*17885a7bSLuigi RizzoThere is one ring for each transmit/receive queue of a 102*17885a7bSLuigi RizzoNIC or virtual port. 103*17885a7bSLuigi RizzoAn additional ring pair connects to the host stack. 104ce3ee1e7SLuigi Rizzo.Pp 105*17885a7bSLuigi RizzoAfter binding a file descriptor to a port, a 106*17885a7bSLuigi Rizzo.Nm 107*17885a7bSLuigi Rizzoclient can send or receive packets in batches through 108*17885a7bSLuigi Rizzothe rings, and possibly implement zero-copy forwarding 109*17885a7bSLuigi Rizzobetween ports. 110ce3ee1e7SLuigi Rizzo.Pp 111*17885a7bSLuigi RizzoAll NICs operating in 11268b8534bSLuigi Rizzo.Nm 113ce3ee1e7SLuigi Rizzomode use the same memory region, 114*17885a7bSLuigi Rizzoaccessible to all processes who own 115*17885a7bSLuigi Rizzo.Nm /dev/netmap 116*17885a7bSLuigi Rizzofile descriptors bound to NICs. 117*17885a7bSLuigi Rizzo.Nm VALE 118*17885a7bSLuigi Rizzoports instead use separate memory regions. 119ce3ee1e7SLuigi Rizzo.Pp 120*17885a7bSLuigi Rizzo.Sh ENTERING AND EXITING NETMAP MODE 121*17885a7bSLuigi RizzoPorts and rings are created and controlled through a file descriptor, 122*17885a7bSLuigi Rizzocreated by opening a special device 123*17885a7bSLuigi Rizzo.Dl fd = open("/dev/netmap"); 124*17885a7bSLuigi Rizzoand then bound to a specific port with an 125*17885a7bSLuigi Rizzo.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 126*17885a7bSLuigi Rizzo.Pp 127*17885a7bSLuigi Rizzo.Nm 128*17885a7bSLuigi Rizzohas multiple modes of operation controlled by the 129*17885a7bSLuigi Rizzo.Vt struct nmreq 130*17885a7bSLuigi Rizzoargument. 131*17885a7bSLuigi Rizzo.Va arg.nr_name 132*17885a7bSLuigi Rizzospecifies the port name, as follows: 133*17885a7bSLuigi Rizzo.Bl -tag -width XXXX 134*17885a7bSLuigi Rizzo.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) 135*17885a7bSLuigi Rizzothe data path of the NIC is disconnected from the host stack, 136*17885a7bSLuigi Rizzoand the file descriptor is bound to the NIC (one or all queues), 137*17885a7bSLuigi Rizzoor to the host stack; 138*17885a7bSLuigi Rizzo.It Dv valeXXX:YYY (arbitrary XXX and YYY) 139*17885a7bSLuigi Rizzothe file descriptor is bound to port YYY of a VALE switch called XXX, 140*17885a7bSLuigi Rizzoboth dynamically created if necessary. 141*17885a7bSLuigi RizzoThe string cannot exceed IFNAMSIZ characters, and YYY cannot 142*17885a7bSLuigi Rizzobe the name of any existing OS network interface. 143*17885a7bSLuigi Rizzo.El 144*17885a7bSLuigi Rizzo.Pp 145*17885a7bSLuigi RizzoOn return, 146*17885a7bSLuigi Rizzo.Va arg 147*17885a7bSLuigi Rizzoindicates the size of the shared memory region, 148*17885a7bSLuigi Rizzoand the number, size and location of all the 149*17885a7bSLuigi Rizzo.Nm 150*17885a7bSLuigi Rizzodata structures, which can be accessed by mmapping the memory 151*17885a7bSLuigi Rizzo.Dl char *mem = mmap(0, arg.nr_memsize, fd); 152*17885a7bSLuigi Rizzo.Pp 153*17885a7bSLuigi RizzoNon blocking I/O is done with special 154*17885a7bSLuigi Rizzo.Xr ioctl 2 155*17885a7bSLuigi Rizzo.Xr select 2 156*17885a7bSLuigi Rizzoand 157*17885a7bSLuigi Rizzo.Xr poll 2 158*17885a7bSLuigi Rizzoon the file descriptor permit blocking I/O. 159*17885a7bSLuigi Rizzo.Xr epoll 2 160*17885a7bSLuigi Rizzoand 161*17885a7bSLuigi Rizzo.Xr kqueue 2 162*17885a7bSLuigi Rizzoare not supported on 163*17885a7bSLuigi Rizzo.Nm 164*17885a7bSLuigi Rizzofile descriptors. 165*17885a7bSLuigi Rizzo.Pp 166*17885a7bSLuigi RizzoWhile a NIC is in 167*17885a7bSLuigi Rizzo.Nm 168*17885a7bSLuigi Rizzomode, the OS will still believe the interface is up and running. 169*17885a7bSLuigi RizzoOS-generated packets for that NIC end up into a 170*17885a7bSLuigi Rizzo.Nm 171*17885a7bSLuigi Rizzoring, and another ring is used to send packets into the OS network stack. 172*17885a7bSLuigi RizzoA 173*17885a7bSLuigi Rizzo.Xr close 2 174*17885a7bSLuigi Rizzoon the file descriptor removes the binding, 175*17885a7bSLuigi Rizzoand returns the NIC to normal mode (reconnecting the data path 176*17885a7bSLuigi Rizzoto the host stack), or destroys the virtual port. 177*17885a7bSLuigi Rizzo.Pp 178*17885a7bSLuigi Rizzo.Sh DATA STRUCTURES 179*17885a7bSLuigi RizzoThe data structures in the mmapped memory region are detailed in 180*17885a7bSLuigi Rizzo.Xr sys/net/netmap.h , 181*17885a7bSLuigi Rizzowhich is the ultimate reference for the 182*17885a7bSLuigi Rizzo.Nm 183*17885a7bSLuigi RizzoAPI. The main structures and fields are indicated below: 18468b8534bSLuigi Rizzo.Bl -tag -width XXX 18568b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface) 18668b8534bSLuigi Rizzo.Bd -literal 18768b8534bSLuigi Rizzostruct netmap_if { 188*17885a7bSLuigi Rizzo ... 189*17885a7bSLuigi Rizzo const uint32_t ni_flags; /* properties */ 190*17885a7bSLuigi Rizzo ... 191*17885a7bSLuigi Rizzo const uint32_t ni_tx_rings; /* NIC tx rings */ 192*17885a7bSLuigi Rizzo const uint32_t ni_rx_rings; /* NIC rx rings */ 193*17885a7bSLuigi Rizzo const uint32_t ni_extra_tx_rings; /* extra tx rings */ 194*17885a7bSLuigi Rizzo const uint32_t ni_extra_rx_rings; /* extra rx rings */ 195*17885a7bSLuigi Rizzo ... 19668b8534bSLuigi Rizzo}; 19768b8534bSLuigi Rizzo.Ed 198ce3ee1e7SLuigi Rizzo.Pp 199*17885a7bSLuigi RizzoIndicates the number of available rings 200*17885a7bSLuigi Rizzo.Pa ( struct netmap_rings ) 201*17885a7bSLuigi Rizzoand their position in the mmapped region. 202*17885a7bSLuigi RizzoThe number of tx and rx rings 203*17885a7bSLuigi Rizzo.Pa ( ni_tx_rings , ni_rx_rings ) 204*17885a7bSLuigi Rizzonormally depends on the hardware. 205*17885a7bSLuigi RizzoNICs also have an extra tx/rx ring pair connected to the host stack. 206*17885a7bSLuigi Rizzo.Em NIOCREGIF 207*17885a7bSLuigi Rizzocan request additional tx/rx rings, 208*17885a7bSLuigi Rizzoto be used between multiple processes/threads 209*17885a7bSLuigi Rizzoaccessing the same 210*17885a7bSLuigi Rizzo.Nm 211*17885a7bSLuigi Rizzoport. 212*17885a7bSLuigi Rizzo.It Dv struct netmap_ring (one per ring) 21368b8534bSLuigi Rizzo.Bd -literal 21468b8534bSLuigi Rizzostruct netmap_ring { 215*17885a7bSLuigi Rizzo ... 216*17885a7bSLuigi Rizzo const uint32_t num_slots; /* slots in each ring */ 217*17885a7bSLuigi Rizzo const uint32_t nr_buf_size; /* size of each buffer */ 218*17885a7bSLuigi Rizzo ... 219*17885a7bSLuigi Rizzo uint32_t head; /* (u) first buf owned by user */ 220*17885a7bSLuigi Rizzo uint32_t cur; /* (u) wakeup position */ 221*17885a7bSLuigi Rizzo const uint32_t tail; /* (k) first buf owned by kernel */ 222*17885a7bSLuigi Rizzo ... 223*17885a7bSLuigi Rizzo uint32_t flags; 224*17885a7bSLuigi Rizzo struct timeval ts; /* (k) time of last rxsync() */ 225*17885a7bSLuigi Rizzo ... 226ce3ee1e7SLuigi Rizzo struct netmap_slot slot[0]; /* array of slots */ 22768b8534bSLuigi Rizzo} 22868b8534bSLuigi Rizzo.Ed 229ce3ee1e7SLuigi Rizzo.Pp 230*17885a7bSLuigi RizzoImplements transmit and receive rings, with read/write 231*17885a7bSLuigi Rizzopointers, metadata and and an array of 232*17885a7bSLuigi Rizzo.Pa slots 233*17885a7bSLuigi Rizzodescribing the buffers. 234*17885a7bSLuigi Rizzo.Pp 235*17885a7bSLuigi Rizzo.It Dv struct netmap_slot (one per buffer) 23668b8534bSLuigi Rizzo.Bd -literal 23768b8534bSLuigi Rizzostruct netmap_slot { 23868b8534bSLuigi Rizzo uint32_t buf_idx; /* buffer index */ 23968b8534bSLuigi Rizzo uint16_t len; /* packet length */ 24068b8534bSLuigi Rizzo uint16_t flags; /* buf changed, etc. */ 241*17885a7bSLuigi Rizzo uint64_t ptr; /* address for indirect buffers */ 24268b8534bSLuigi Rizzo}; 24368b8534bSLuigi Rizzo.Ed 244*17885a7bSLuigi Rizzo.Pp 245*17885a7bSLuigi RizzoDescribes a packet buffer, which normally is identified by 246*17885a7bSLuigi Rizzoan index and resides in the mmapped region. 24768b8534bSLuigi Rizzo.It Dv packet buffers 248*17885a7bSLuigi RizzoFixed size (normally 2 KB) packet buffers allocated by the kernel. 249ce3ee1e7SLuigi Rizzo.El 250ce3ee1e7SLuigi Rizzo.Pp 251*17885a7bSLuigi RizzoThe offset of the 252*17885a7bSLuigi Rizzo.Pa struct netmap_if 253*17885a7bSLuigi Rizzoin the mmapped region is indicated by the 254*17885a7bSLuigi Rizzo.Pa nr_offset 255*17885a7bSLuigi Rizzofield in the structure returned by 256*17885a7bSLuigi Rizzo.Pa NIOCREGIF . 257*17885a7bSLuigi RizzoFrom there, all other objects are reachable through 258*17885a7bSLuigi Rizzorelative references (offsets or indexes). 259*17885a7bSLuigi RizzoMacros and functions in <net/netmap_user.h> 260*17885a7bSLuigi Rizzohelp converting them into actual pointers: 261*17885a7bSLuigi Rizzo.Pp 262*17885a7bSLuigi Rizzo.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 263*17885a7bSLuigi Rizzo.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 264*17885a7bSLuigi Rizzo.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 265*17885a7bSLuigi Rizzo.Pp 266*17885a7bSLuigi Rizzo.Dl char *buf = NETMAP_BUF(ring, buffer_index); 267*17885a7bSLuigi Rizzo.Sh RINGS, BUFFERS AND DATA I/O 268*17885a7bSLuigi Rizzo.Va Rings 269*17885a7bSLuigi Rizzoare circular queues of packets with three indexes/pointers 270*17885a7bSLuigi Rizzo.Va ( head , cur , tail ) ; 271*17885a7bSLuigi Rizzoone slot is always kept empty. 272*17885a7bSLuigi RizzoThe ring size 273*17885a7bSLuigi Rizzo.Va ( num_slots ) 274*17885a7bSLuigi Rizzoshould not be assumed to be a power of two. 275*17885a7bSLuigi Rizzo.br 276*17885a7bSLuigi Rizzo(NOTE: older versions of netmap used head/count format to indicate 277*17885a7bSLuigi Rizzothe content of a ring). 278*17885a7bSLuigi Rizzo.Pp 279*17885a7bSLuigi Rizzo.Va head 280*17885a7bSLuigi Rizzois the first slot available to userspace; 281*17885a7bSLuigi Rizzo.br 282*17885a7bSLuigi Rizzo.Va cur 283*17885a7bSLuigi Rizzois the wakeup point: 284*17885a7bSLuigi Rizzoselect/poll will unblock when 285*17885a7bSLuigi Rizzo.Va tail 286*17885a7bSLuigi Rizzopasses 287*17885a7bSLuigi Rizzo.Va cur ; 288*17885a7bSLuigi Rizzo.br 289*17885a7bSLuigi Rizzo.Va tail 290*17885a7bSLuigi Rizzois the first slot reserved to the kernel. 291*17885a7bSLuigi Rizzo.Pp 292*17885a7bSLuigi RizzoSlot indexes MUST only move forward; 293*17885a7bSLuigi Rizzofor convenience, the function 294*17885a7bSLuigi Rizzo.Dl nm_ring_next(ring, index) 295*17885a7bSLuigi Rizzoreturns the next index modulo the ring size. 296*17885a7bSLuigi Rizzo.Pp 297*17885a7bSLuigi Rizzo.Va head 298*17885a7bSLuigi Rizzoand 299*17885a7bSLuigi Rizzo.Va cur 300*17885a7bSLuigi Rizzoare only modified by the user program; 301*17885a7bSLuigi Rizzo.Va tail 302*17885a7bSLuigi Rizzois only modified by the kernel. 303*17885a7bSLuigi RizzoThe kernel only reads/writes the 304*17885a7bSLuigi Rizzo.Vt struct netmap_ring 305*17885a7bSLuigi Rizzoslots and buffers 306*17885a7bSLuigi Rizzoduring the execution of a netmap-related system call. 307*17885a7bSLuigi RizzoThe only exception are slots (and buffers) in the range 308*17885a7bSLuigi Rizzo.Va tail\ . . . head-1 , 309*17885a7bSLuigi Rizzothat are explicitly assigned to the kernel. 310*17885a7bSLuigi Rizzo.Pp 311*17885a7bSLuigi Rizzo.Ss TRANSMIT RINGS 312*17885a7bSLuigi RizzoOn transmit rings, after a 313*17885a7bSLuigi Rizzo.Nm 314*17885a7bSLuigi Rizzosystem call, slots in the range 315*17885a7bSLuigi Rizzo.Va head\ . . . tail-1 316*17885a7bSLuigi Rizzoare available for transmission. 317*17885a7bSLuigi RizzoUser code should fill the slots sequentially 318*17885a7bSLuigi Rizzoand advance 319*17885a7bSLuigi Rizzo.Va head 320*17885a7bSLuigi Rizzoand 321*17885a7bSLuigi Rizzo.Va cur 322*17885a7bSLuigi Rizzopast slots ready to transmit. 323*17885a7bSLuigi Rizzo.Va cur 324*17885a7bSLuigi Rizzomay be moved further ahead if the user code needs 325*17885a7bSLuigi Rizzomore slots before further transmissions (see 326*17885a7bSLuigi Rizzo.Sx SCATTER GATHER I/O ) . 327*17885a7bSLuigi Rizzo.Pp 328*17885a7bSLuigi RizzoAt the next NIOCTXSYNC/select()/poll(), 329*17885a7bSLuigi Rizzoslots up to 330*17885a7bSLuigi Rizzo.Va head-1 331*17885a7bSLuigi Rizzoare pushed to the port, and 332*17885a7bSLuigi Rizzo.Va tail 333*17885a7bSLuigi Rizzomay advance if further slots have become available. 334*17885a7bSLuigi RizzoBelow is an example of the evolution of a TX ring: 335*17885a7bSLuigi Rizzo.Pp 336*17885a7bSLuigi Rizzo.Bd -literal 337*17885a7bSLuigi Rizzo after the syscall, slots between cur and tail are (a)vailable 338*17885a7bSLuigi Rizzo head=cur tail 339*17885a7bSLuigi Rizzo | | 340*17885a7bSLuigi Rizzo v v 341*17885a7bSLuigi Rizzo TX [.....aaaaaaaaaaa.............] 342*17885a7bSLuigi Rizzo 343*17885a7bSLuigi Rizzo user creates new packets to (T)ransmit 344*17885a7bSLuigi Rizzo head=cur tail 345*17885a7bSLuigi Rizzo | | 346*17885a7bSLuigi Rizzo v v 347*17885a7bSLuigi Rizzo TX [.....TTTTTaaaaaa.............] 348*17885a7bSLuigi Rizzo 349*17885a7bSLuigi Rizzo NIOCTXSYNC/poll()/select() sends packets and reports new slots 350*17885a7bSLuigi Rizzo head=cur tail 351*17885a7bSLuigi Rizzo | | 352*17885a7bSLuigi Rizzo v v 353*17885a7bSLuigi Rizzo TX [..........aaaaaaaaaaa........] 354*17885a7bSLuigi Rizzo.Ed 355*17885a7bSLuigi Rizzo.Pp 356*17885a7bSLuigi Rizzoselect() and poll() wlll block if there is no space in the ring, i.e. 357*17885a7bSLuigi Rizzo.Dl ring->cur == ring->tail 358*17885a7bSLuigi Rizzoand return when new slots have become available. 359*17885a7bSLuigi Rizzo.Pp 360*17885a7bSLuigi RizzoHigh speed applications may want to amortize the cost of system calls 361*17885a7bSLuigi Rizzoby preparing as many packets as possible before issuing them. 362*17885a7bSLuigi Rizzo.Pp 363*17885a7bSLuigi RizzoA transmit ring with pending transmissions has 364*17885a7bSLuigi Rizzo.Dl ring->head != ring->tail + 1 (modulo the ring size). 365*17885a7bSLuigi RizzoThe function 366*17885a7bSLuigi Rizzo.Va int nm_tx_pending(ring) 367*17885a7bSLuigi Rizzoimplements this test. 368*17885a7bSLuigi Rizzo.Pp 369*17885a7bSLuigi Rizzo.Ss RECEIVE RINGS 370*17885a7bSLuigi RizzoOn receive rings, after a 371*17885a7bSLuigi Rizzo.Nm 372*17885a7bSLuigi Rizzosystem call, the slots in the range 373*17885a7bSLuigi Rizzo.Va head\& . . . tail-1 374*17885a7bSLuigi Rizzocontain received packets. 375*17885a7bSLuigi RizzoUser code should process them and advance 376*17885a7bSLuigi Rizzo.Va head 377*17885a7bSLuigi Rizzoand 378*17885a7bSLuigi Rizzo.Va cur 379*17885a7bSLuigi Rizzopast slots it wants to return to the kernel. 380*17885a7bSLuigi Rizzo.Va cur 381*17885a7bSLuigi Rizzomay be moved further ahead if the user code wants to 382*17885a7bSLuigi Rizzowait for more packets 383*17885a7bSLuigi Rizzowithout returning all the previous slots to the kernel. 384*17885a7bSLuigi Rizzo.Pp 385*17885a7bSLuigi RizzoAt the next NIOCRXSYNC/select()/poll(), 386*17885a7bSLuigi Rizzoslots up to 387*17885a7bSLuigi Rizzo.Va head-1 388*17885a7bSLuigi Rizzoare returned to the kernel for further receives, and 389*17885a7bSLuigi Rizzo.Va tail 390*17885a7bSLuigi Rizzomay advance to report new incoming packets. 391*17885a7bSLuigi Rizzo.br 392*17885a7bSLuigi RizzoBelow is an example of the evolution of an RX ring: 393*17885a7bSLuigi Rizzo.Bd -literal 394*17885a7bSLuigi Rizzo after the syscall, there are some (h)eld and some (R)eceived slots 395*17885a7bSLuigi Rizzo head cur tail 396*17885a7bSLuigi Rizzo | | | 397*17885a7bSLuigi Rizzo v v v 398*17885a7bSLuigi Rizzo RX [..hhhhhhRRRRRRRR..........] 399*17885a7bSLuigi Rizzo 400*17885a7bSLuigi Rizzo user advances head and cur, releasing some slots and holding others 401*17885a7bSLuigi Rizzo head cur tail 402*17885a7bSLuigi Rizzo | | | 403*17885a7bSLuigi Rizzo v v v 404*17885a7bSLuigi Rizzo RX [..*****hhhRRRRRR...........] 405*17885a7bSLuigi Rizzo 406*17885a7bSLuigi Rizzo NICRXSYNC/poll()/select() recovers slots and reports new packets 407*17885a7bSLuigi Rizzo head cur tail 408*17885a7bSLuigi Rizzo | | | 409*17885a7bSLuigi Rizzo v v v 410*17885a7bSLuigi Rizzo RX [.......hhhRRRRRRRRRRRR....] 411*17885a7bSLuigi Rizzo.Ed 412*17885a7bSLuigi Rizzo.Pp 413*17885a7bSLuigi Rizzo.Sh SLOTS AND PACKET BUFFERS 414*17885a7bSLuigi RizzoNormally, packets should be stored in the netmap-allocated buffers 415*17885a7bSLuigi Rizzoassigned to slots when ports are bound to a file descriptor. 416*17885a7bSLuigi RizzoOne packet is fully contained in a single buffer. 417*17885a7bSLuigi Rizzo.Pp 418*17885a7bSLuigi RizzoThe following flags affect slot and buffer processing: 419ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX 420ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED 421*17885a7bSLuigi Rizzoit MUST be used when the buf_idx in the slot is changed. 422*17885a7bSLuigi RizzoThis can be used to implement 423*17885a7bSLuigi Rizzozero-copy forwarding, see 424*17885a7bSLuigi Rizzo.Sx ZERO-COPY FORWARDING . 425ce3ee1e7SLuigi Rizzo.Pp 426ce3ee1e7SLuigi Rizzo.It NS_REPORT 427*17885a7bSLuigi Rizzoreports when this buffer has been transmitted. 428ce3ee1e7SLuigi RizzoNormally, 429ce3ee1e7SLuigi Rizzo.Nm 430ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals 431*17885a7bSLuigi Rizzocan be delayed indefinitely. This flag helps detecting 432*17885a7bSLuigi Rizzowhen packets have been send and a file descriptor can be closed. 433ce3ee1e7SLuigi Rizzo.It NS_FORWARD 434*17885a7bSLuigi RizzoWhen a ring is in 'transparent' mode (see 435*17885a7bSLuigi Rizzo.Sx TRANSPARENT MODE ) , 436*17885a7bSLuigi Rizzopackets marked with this flags are forwarded to the other endpoint 437*17885a7bSLuigi Rizzoat the next system call, thus restoring (in a selective way) 438*17885a7bSLuigi Rizzothe connection between a NIC and the host stack. 439ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN 440ce3ee1e7SLuigi Rizzotells the forwarding code that the SRC MAC address for this 441*17885a7bSLuigi Rizzopacket must not be used in the learning bridge code. 442ce3ee1e7SLuigi Rizzo.It NS_INDIRECT 443*17885a7bSLuigi Rizzoindicates that the packet's payload is in a user-supplied buffer, 444*17885a7bSLuigi Rizzowhose user virtual address is in the 'ptr' field of the slot. 445ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes. 446*17885a7bSLuigi Rizzo.br 447*17885a7bSLuigi RizzoThis is only supported on the transmit ring of 448*17885a7bSLuigi Rizzo.Nm VALE 449*17885a7bSLuigi Rizzoports, and it helps reducing data copies in the interconnection 450*17885a7bSLuigi Rizzoof virtual machines. 451ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG 452ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers; 453ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear. 454ce3ee1e7SLuigi Rizzo.El 455*17885a7bSLuigi Rizzo.Sh SCATTER GATHER I/O 456*17885a7bSLuigi RizzoPackets can span multiple slots if the 457*17885a7bSLuigi Rizzo.Va NS_MOREFRAG 458*17885a7bSLuigi Rizzoflag is set in all but the last slot. 459*17885a7bSLuigi RizzoThe maximum length of a chain is 64 buffers. 460*17885a7bSLuigi RizzoThis is normally used with 461*17885a7bSLuigi Rizzo.Nm VALE 462*17885a7bSLuigi Rizzoports when connecting virtual machines, as they generate large 463*17885a7bSLuigi RizzoTSO segments that are not split unless they reach a physical device. 464*17885a7bSLuigi Rizzo.Pp 465*17885a7bSLuigi RizzoNOTE: The length field always refers to the individual 466*17885a7bSLuigi Rizzofragment; there is no place with the total length of a packet. 467*17885a7bSLuigi Rizzo.Pp 468*17885a7bSLuigi RizzoOn receive rings the macro 469*17885a7bSLuigi Rizzo.Va NS_RFRAGS(slot) 470*17885a7bSLuigi Rizzoindicates the remaining number of slots for this packet, 471*17885a7bSLuigi Rizzoincluding the current one. 472*17885a7bSLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set. 47313a5d88fSLuigi Rizzo.Sh IOCTLS 47468b8534bSLuigi Rizzo.Nm 475*17885a7bSLuigi Rizzouses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 476*17885a7bSLuigi Rizzofor non-blocking I/O. They take no argument. 477*17885a7bSLuigi RizzoTwo more ioctls (NIOCGINFO, NIOCREGIF) are used 478*17885a7bSLuigi Rizzoto query and configure ports, with the following argument: 47968b8534bSLuigi Rizzo.Bd -literal 48068b8534bSLuigi Rizzostruct nmreq { 481*17885a7bSLuigi Rizzo char nr_name[IFNAMSIZ]; /* (i) port name */ 482*17885a7bSLuigi Rizzo uint32_t nr_version; /* (i) API version */ 483*17885a7bSLuigi Rizzo uint32_t nr_offset; /* (o) nifp offset in mmap region */ 484*17885a7bSLuigi Rizzo uint32_t nr_memsize; /* (o) size of the mmap region */ 485*17885a7bSLuigi Rizzo uint32_t nr_tx_slots; /* (o) slots in tx rings */ 486*17885a7bSLuigi Rizzo uint32_t nr_rx_slots; /* (o) slots in rx rings */ 487*17885a7bSLuigi Rizzo uint16_t nr_tx_rings; /* (o) number of tx rings */ 488*17885a7bSLuigi Rizzo uint16_t nr_rx_rings; /* (o) number of tx rings */ 489*17885a7bSLuigi Rizzo uint16_t nr_ringid; /* (i) ring(s) we care about */ 490*17885a7bSLuigi Rizzo uint16_t nr_cmd; /* (i) special command */ 491*17885a7bSLuigi Rizzo uint16_t nr_arg1; /* (i) extra arguments */ 492*17885a7bSLuigi Rizzo uint16_t nr_arg2; /* (i) extra arguments */ 493*17885a7bSLuigi Rizzo ... 49468b8534bSLuigi Rizzo}; 49568b8534bSLuigi Rizzo.Ed 49668b8534bSLuigi Rizzo.Pp 497*17885a7bSLuigi RizzoA file descriptor obtained through 498*17885a7bSLuigi Rizzo.Pa /dev/netmap 499*17885a7bSLuigi Rizzoalso supports the ioctl supported by network devices, see 500*17885a7bSLuigi Rizzo.Xr netintro 4 . 501*17885a7bSLuigi Rizzo.Pp 50268b8534bSLuigi Rizzo.Bl -tag -width XXXX 50368b8534bSLuigi Rizzo.It Dv NIOCGINFO 504*17885a7bSLuigi Rizzoreturns EINVAL if the named port does not support netmap. 505ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information 506*17885a7bSLuigi Rizzoabout the port. 507ce3ee1e7SLuigi RizzoNote that all the information below can change before the 508ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode. 50968b8534bSLuigi Rizzo.Pp 510*17885a7bSLuigi Rizzo.Bl -tag -width XX 511*17885a7bSLuigi Rizzo.It Pa nr_memsize 512*17885a7bSLuigi Rizzoindicates the size of the 513*17885a7bSLuigi Rizzo.Nm 514*17885a7bSLuigi Rizzomemory region. NICs in 515*17885a7bSLuigi Rizzo.Nm 516*17885a7bSLuigi Rizzomode all share the same memory region, 517*17885a7bSLuigi Rizzowhereas 518*17885a7bSLuigi Rizzo.Nm VALE 519*17885a7bSLuigi Rizzoports have independent regions for each port. 520*17885a7bSLuigi Rizzo.It Pa nr_tx_slots , nr_rx_slots 521ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings. 522*17885a7bSLuigi Rizzo.It Pa nr_tx_rings , nr_rx_rings 523ce3ee1e7SLuigi Rizzoindicate the number of transmit 524ce3ee1e7SLuigi Rizzoand receive rings. 525ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime 526ce3ee1e7SLuigi Rizzousing interface-specific functions (e.g. 527*17885a7bSLuigi Rizzo.Xr ethtool 528*17885a7bSLuigi Rizzo). 529*17885a7bSLuigi Rizzo.El 53068b8534bSLuigi Rizzo.It Dv NIOCREGIF 531*17885a7bSLuigi Rizzobinds the port named in 532*17885a7bSLuigi Rizzo.Va nr_name 533*17885a7bSLuigi Rizzoto the file descriptor. For a physical device this also switches it into 534*17885a7bSLuigi Rizzo.Nm 535*17885a7bSLuigi Rizzomode, disconnecting 536*17885a7bSLuigi Rizzoit from the host stack. 537*17885a7bSLuigi RizzoMultiple file descriptors can be bound to the same port, 538*17885a7bSLuigi Rizzowith proper synchronization left to the user. 539*17885a7bSLuigi Rizzo.Pp 54068b8534bSLuigi RizzoOn return, it gives the same info as NIOCGINFO, and nr_ringid 54168b8534bSLuigi Rizzoindicates the identity of the rings controlled through the file 54268b8534bSLuigi Rizzodescriptor. 54368b8534bSLuigi Rizzo.Pp 544*17885a7bSLuigi Rizzo.Va nr_ringid 545*17885a7bSLuigi Rizzoselects which rings are controlled through this file descriptor. 546*17885a7bSLuigi RizzoPossible values are: 54768b8534bSLuigi Rizzo.Bl -tag -width XXXXX 54868b8534bSLuigi Rizzo.It 0 549*17885a7bSLuigi Rizzo(default) all hardware rings 55068b8534bSLuigi Rizzo.It NETMAP_SW_RING 551*17885a7bSLuigi Rizzothe ``host rings'', connecting to the host stack. 552*17885a7bSLuigi Rizzo.It NETMAP_HW_RING | i 553*17885a7bSLuigi Rizzothe i-th hardware ring . 55468b8534bSLuigi Rizzo.El 555*17885a7bSLuigi Rizzo.Pp 55668b8534bSLuigi RizzoBy default, a 557*17885a7bSLuigi Rizzo.Xr poll 2 55868b8534bSLuigi Rizzoor 559*17885a7bSLuigi Rizzo.Xr select 2 56068b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if 56168b8534bSLuigi Rizzono write events are specified. 56268b8534bSLuigi RizzoThe feature can be disabled by or-ing 563*17885a7bSLuigi Rizzo.Va NETMAP_NO_TX_SYNC 564*17885a7bSLuigi Rizzoto the value written to 565*17885a7bSLuigi Rizzo.Va nr_ringid. 566*17885a7bSLuigi RizzoWhen this feature is used, 567*17885a7bSLuigi Rizzopackets are transmitted only on 568*17885a7bSLuigi Rizzo.Va ioctl(NIOCTXSYNC) 569*17885a7bSLuigi Rizzoor select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. 570ce3ee1e7SLuigi Rizzo.Pp 571ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a 572ce3ee1e7SLuigi Rizzo.Xr vale 4 573ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default, 574ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 57568b8534bSLuigi Rizzo.It Dv NIOCTXSYNC 57668b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the 57768b8534bSLuigi Rizzonumber of slots available for transmission. 57868b8534bSLuigi Rizzo.It Dv NIOCRXSYNC 57968b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available 58068b8534bSLuigi Rizzopackets. 58168b8534bSLuigi Rizzo.El 582*17885a7bSLuigi Rizzo.Sh SELECT AND POLL 583*17885a7bSLuigi Rizzo.Xr select 2 584*17885a7bSLuigi Rizzoand 585*17885a7bSLuigi Rizzo.Xr poll 2 586*17885a7bSLuigi Rizzoon a 587*17885a7bSLuigi Rizzo.Nm 588*17885a7bSLuigi Rizzofile descriptor process rings as indicated in 589*17885a7bSLuigi Rizzo.Sx TRANSMIT RINGS 590*17885a7bSLuigi Rizzoand 591*17885a7bSLuigi Rizzo.Sx RECEIVE RINGS 592*17885a7bSLuigi Rizzowhen write (POLLOUT) and read (POLLIN) events are requested. 593*17885a7bSLuigi Rizzo.Pp 594*17885a7bSLuigi RizzoBoth block if no slots are available in the ring ( 595*17885a7bSLuigi Rizzo.Va ring->cur == ring->tail ) 596*17885a7bSLuigi Rizzo.Pp 597*17885a7bSLuigi RizzoPackets in transmit rings are normally pushed out even without 598*17885a7bSLuigi Rizzorequesting write events. Passing the NETMAP_NO_TX_SYNC flag to 599*17885a7bSLuigi Rizzo.Em NIOCREGIF 600*17885a7bSLuigi Rizzodisables this feature. 601*17885a7bSLuigi Rizzo.Sh LIBRARIES 602*17885a7bSLuigi RizzoThe 603*17885a7bSLuigi Rizzo.Nm 604*17885a7bSLuigi RizzoAPI is supposed to be used directly, both because of its simplicity and 605*17885a7bSLuigi Rizzofor efficient integration with applications. 606*17885a7bSLuigi Rizzo.Pp 607*17885a7bSLuigi RizzoFor conveniency, the 608*17885a7bSLuigi Rizzo.Va <net/netmap_user.h> 609*17885a7bSLuigi Rizzoheader provides a few macros and functions to ease creating 610*17885a7bSLuigi Rizzoa file descriptor and doing I/O with a 611*17885a7bSLuigi Rizzo.Nm 612*17885a7bSLuigi Rizzoport. These are loosely modeled after the 613*17885a7bSLuigi Rizzo.Xr pcap 3 614*17885a7bSLuigi RizzoAPI, to ease porting of libpcap-based applications to 615*17885a7bSLuigi Rizzo.Nm . 616*17885a7bSLuigi RizzoTo use these extra functions, programs should 617*17885a7bSLuigi Rizzo.Dl #define NETMAP_WITH_LIBS 618*17885a7bSLuigi Rizzobefore 619*17885a7bSLuigi Rizzo.Dl #include <net/netmap_user.h> 620*17885a7bSLuigi Rizzo.Pp 621*17885a7bSLuigi RizzoThe following functions are available: 622*17885a7bSLuigi Rizzo.Bl -tag -width XXXXX 623*17885a7bSLuigi Rizzo.It Va struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags) 624*17885a7bSLuigi Rizzosimilar to 625*17885a7bSLuigi Rizzo.Xr pcap_open , 626*17885a7bSLuigi Rizzobinds a file descriptor to a port. 627*17885a7bSLuigi Rizzo.Bl -tag -width XX 628*17885a7bSLuigi Rizzo.It Va ifname 629*17885a7bSLuigi Rizzois a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a 630*17885a7bSLuigi Rizzo.Nm VALE 631*17885a7bSLuigi Rizzoport. 632*17885a7bSLuigi Rizzo.It Va flags 633*17885a7bSLuigi Rizzocan be set to 634*17885a7bSLuigi Rizzo.Va NETMAP_SW_RING 635*17885a7bSLuigi Rizzoto bind to the host ring pair, 636*17885a7bSLuigi Rizzoor to NETMAP_HW_RING to bind to a specific ring. 637*17885a7bSLuigi Rizzo.Va ring_name 638*17885a7bSLuigi Rizzowith NETMAP_HW_RING, 639*17885a7bSLuigi Rizzois interpreted as a string or an integer indicating the ring to use. 640*17885a7bSLuigi Rizzo.It Va ring_flags 641*17885a7bSLuigi Rizzois copied directly into the ring flags, to specify additional parameters 642*17885a7bSLuigi Rizzosuch as NR_TIMESTAMP or NR_FORWARD. 643*17885a7bSLuigi Rizzo.El 644*17885a7bSLuigi Rizzo.It Va int nm_close(struct nm_desc_t *d) 645*17885a7bSLuigi Rizzocloses the file descriptor, unmaps memory, frees resources. 646*17885a7bSLuigi Rizzo.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size) 647*17885a7bSLuigi Rizzosimilar to pcap_inject(), pushes a packet to a ring, returns the size 648*17885a7bSLuigi Rizzoof the packet is successful, or 0 on error; 649*17885a7bSLuigi Rizzo.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg) 650*17885a7bSLuigi Rizzosimilar to pcap_dispatch(), applies a callback to incoming packets 651*17885a7bSLuigi Rizzo.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr) 652*17885a7bSLuigi Rizzosimilar to pcap_next(), fetches the next packet 653*17885a7bSLuigi Rizzo.Pp 654*17885a7bSLuigi Rizzo.El 655*17885a7bSLuigi Rizzo.Sh SUPPORTED DEVICES 656*17885a7bSLuigi Rizzo.Nm 657*17885a7bSLuigi Rizzonatively supports the following devices: 658*17885a7bSLuigi Rizzo.Pp 659*17885a7bSLuigi RizzoOn FreeBSD: 660*17885a7bSLuigi Rizzo.Xr em 4 , 661*17885a7bSLuigi Rizzo.Xr igb 4 , 662*17885a7bSLuigi Rizzo.Xr ixgbe 4 , 663*17885a7bSLuigi Rizzo.Xr lem 4 , 664*17885a7bSLuigi Rizzo.Xr re 4 . 665*17885a7bSLuigi Rizzo.Pp 666*17885a7bSLuigi RizzoOn Linux 667*17885a7bSLuigi Rizzo.Xr e1000 4 , 668*17885a7bSLuigi Rizzo.Xr e1000e 4 , 669*17885a7bSLuigi Rizzo.Xr igb 4 , 670*17885a7bSLuigi Rizzo.Xr ixgbe 4 , 671*17885a7bSLuigi Rizzo.Xr mlx4 4 , 672*17885a7bSLuigi Rizzo.Xr forcedeth 4 , 673*17885a7bSLuigi Rizzo.Xr r8169 4 . 674*17885a7bSLuigi Rizzo.Pp 675*17885a7bSLuigi RizzoNICs without native support can still be used in 676*17885a7bSLuigi Rizzo.Nm 677*17885a7bSLuigi Rizzomode through emulation. Performance is inferior to native netmap 678*17885a7bSLuigi Rizzomode but still significantly higher than sockets, and approaching 679*17885a7bSLuigi Rizzothat of in-kernel solutions such as Linux's 680*17885a7bSLuigi Rizzo.Xr pktgen . 681*17885a7bSLuigi Rizzo.Pp 682*17885a7bSLuigi RizzoEmulation is also available for devices with native netmap support, 683*17885a7bSLuigi Rizzowhich can be used for testing or performance comparison. 684*17885a7bSLuigi RizzoThe sysctl variable 685*17885a7bSLuigi Rizzo.Va dev.netmap.admode 686*17885a7bSLuigi Rizzoglobally controls how netmap mode is implemented. 687*17885a7bSLuigi Rizzo.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 688*17885a7bSLuigi RizzoSome aspect of the operation of 689*17885a7bSLuigi Rizzo.Nm 690*17885a7bSLuigi Rizzoare controlled through sysctl variables on FreeBSD 691*17885a7bSLuigi Rizzo.Em ( dev.netmap.* ) 692*17885a7bSLuigi Rizzoand module parameters on Linux 693*17885a7bSLuigi Rizzo.Em ( /sys/module/netmap_lin/parameters/* ) : 694*17885a7bSLuigi Rizzo.Pp 695*17885a7bSLuigi Rizzo.Bl -tag -width indent 696*17885a7bSLuigi Rizzo.It Va dev.netmap.admode: 0 697*17885a7bSLuigi RizzoControls the use of native or emulated adapter mode. 698*17885a7bSLuigi Rizzo0 uses the best available option, 1 forces native and 699*17885a7bSLuigi Rizzofails if not available, 2 forces emulated hence never fails. 700*17885a7bSLuigi Rizzo.It Va dev.netmap.generic_ringsize: 1024 701*17885a7bSLuigi RizzoRing size used for emulated netmap mode 702*17885a7bSLuigi Rizzo.It Va dev.netmap.generic_mit: 100000 703*17885a7bSLuigi RizzoControls interrupt moderation for emulated mode 704*17885a7bSLuigi Rizzo.It Va dev.netmap.mmap_unreg: 0 705*17885a7bSLuigi Rizzo.It Va dev.netmap.fwd: 0 706*17885a7bSLuigi RizzoForces NS_FORWARD mode 707*17885a7bSLuigi Rizzo.It Va dev.netmap.flags: 0 708*17885a7bSLuigi Rizzo.It Va dev.netmap.txsync_retry: 2 709*17885a7bSLuigi Rizzo.It Va dev.netmap.no_pendintr: 1 710*17885a7bSLuigi RizzoForces recovery of transmit buffers on system calls 711*17885a7bSLuigi Rizzo.It Va dev.netmap.mitigate: 1 712*17885a7bSLuigi RizzoPropagates interrupt mitigation to user processes 713*17885a7bSLuigi Rizzo.It Va dev.netmap.no_timestamp: 0 714*17885a7bSLuigi RizzoDisables the update of the timestamp in the netmap ring 715*17885a7bSLuigi Rizzo.It Va dev.netmap.verbose: 0 716*17885a7bSLuigi RizzoVerbose kernel messages 717*17885a7bSLuigi Rizzo.It Va dev.netmap.buf_num: 163840 718*17885a7bSLuigi Rizzo.It Va dev.netmap.buf_size: 2048 719*17885a7bSLuigi Rizzo.It Va dev.netmap.ring_num: 200 720*17885a7bSLuigi Rizzo.It Va dev.netmap.ring_size: 36864 721*17885a7bSLuigi Rizzo.It Va dev.netmap.if_num: 100 722*17885a7bSLuigi Rizzo.It Va dev.netmap.if_size: 1024 723*17885a7bSLuigi RizzoSizes and number of objects (netmap_if, netmap_ring, buffers) 724*17885a7bSLuigi Rizzofor the global memory region. The only parameter worth modifying is 725*17885a7bSLuigi Rizzo.Va dev.netmap.buf_num 726*17885a7bSLuigi Rizzoas it impacts the total amount of memory used by netmap. 727*17885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_num: 0 728*17885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_size: 0 729*17885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_num: 0 730*17885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_size: 0 731*17885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_num: 0 732*17885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_size: 0 733*17885a7bSLuigi RizzoActual values in use. 734*17885a7bSLuigi Rizzo.It Va dev.netmap.bridge_batch: 1024 735*17885a7bSLuigi RizzoBatch size used when moving packets across a 736*17885a7bSLuigi Rizzo.Nm VALE 737*17885a7bSLuigi Rizzoswitch. Values above 64 generally guarantee good 738*17885a7bSLuigi Rizzoperformance. 739*17885a7bSLuigi Rizzo.El 74013a5d88fSLuigi Rizzo.Sh SYSTEM CALLS 74168b8534bSLuigi Rizzo.Nm 74268b8534bSLuigi Rizzouses 743ce3ee1e7SLuigi Rizzo.Xr select 2 74468b8534bSLuigi Rizzoand 745ce3ee1e7SLuigi Rizzo.Xr poll 2 746ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and 747ce3ee1e7SLuigi Rizzo.Xr mmap 2 748ce3ee1e7SLuigi Rizzoto map memory. 749*17885a7bSLuigi Rizzo.Xr ioctl 2 750*17885a7bSLuigi Rizzois used to configure ports and 751*17885a7bSLuigi Rizzo.Nm VALE switches . 752ce3ee1e7SLuigi Rizzo.Pp 753ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to 754ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard 755ce3ee1e7SLuigi RizzoOS primitives, see 756ce3ee1e7SLuigi Rizzo.Xr pthread 3 . 757ce3ee1e7SLuigi RizzoIn particular, 758ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3 759ce3ee1e7SLuigi Rizzomay be of use. 760*17885a7bSLuigi Rizzo.Sh CAVEATS 761*17885a7bSLuigi RizzoNo matter how fast the CPU and OS are, 762*17885a7bSLuigi Rizzoachieving line rate on 10G and faster interfaces 763*17885a7bSLuigi Rizzorequires hardware with sufficient performance. 764*17885a7bSLuigi RizzoSeveral NICs are unable to sustain line rate with 765*17885a7bSLuigi Rizzosmall packet sizes. Insufficient PCIe or memory bandwidth 766*17885a7bSLuigi Rizzocan also cause reduced performance. 767*17885a7bSLuigi Rizzo.Pp 768*17885a7bSLuigi RizzoAnother frequent reason for low performance is the use 769*17885a7bSLuigi Rizzoof flow control on the link: a slow receiver can limit 770*17885a7bSLuigi Rizzothe transmit speed. 771*17885a7bSLuigi RizzoBe sure to disable flow control when running high 772*17885a7bSLuigi Rizzospeed experiments. 773*17885a7bSLuigi Rizzo.Pp 774*17885a7bSLuigi Rizzo.Ss SPECIAL NIC FEATURES 775*17885a7bSLuigi Rizzo.Nm 776*17885a7bSLuigi Rizzois orthogonal to some NIC features such as 777*17885a7bSLuigi Rizzomultiqueue, schedulers, packet filters. 778*17885a7bSLuigi Rizzo.Pp 779*17885a7bSLuigi RizzoMultiple transmit and receive rings are supported natively 780*17885a7bSLuigi Rizzoand can be configured with ordinary OS tools, 781*17885a7bSLuigi Rizzosuch as 782*17885a7bSLuigi Rizzo.Xr ethtool 783*17885a7bSLuigi Rizzoor 784*17885a7bSLuigi Rizzodevice-specific sysctl variables. 785*17885a7bSLuigi RizzoThe same goes for Receive Packet Steering (RPS) 786*17885a7bSLuigi Rizzoand filtering of incoming traffic. 787*17885a7bSLuigi Rizzo.Pp 788*17885a7bSLuigi Rizzo.Nm 789*17885a7bSLuigi Rizzo.Em does not use 790*17885a7bSLuigi Rizzofeatures such as 791*17885a7bSLuigi Rizzo.Em checksum offloading , TCP segmentation offloading , 792*17885a7bSLuigi Rizzo.Em encryption , VLAN encapsulation/decapsulation , 793*17885a7bSLuigi Rizzoetc. . 794*17885a7bSLuigi RizzoWhen using netmap to exchange packets with the host stack, 795*17885a7bSLuigi Rizzomake sure to disable these features. 79668b8534bSLuigi Rizzo.Sh EXAMPLES 797*17885a7bSLuigi Rizzo.Ss TEST PROGRAMS 798*17885a7bSLuigi Rizzo.Nm 799*17885a7bSLuigi Rizzocomes with a few programs that can be used for testing or 800*17885a7bSLuigi Rizzosimple applications. 801*17885a7bSLuigi RizzoSee the 802*17885a7bSLuigi Rizzo.Va examples/ 803*17885a7bSLuigi Rizzodirectory in 804*17885a7bSLuigi Rizzo.Nm 805*17885a7bSLuigi Rizzodistributions, or 806*17885a7bSLuigi Rizzo.Va tools/tools/netmap/ 807*17885a7bSLuigi Rizzodirectory in FreeBSD distributions. 808*17885a7bSLuigi Rizzo.Pp 809*17885a7bSLuigi Rizzo.Xr pkt-gen 810*17885a7bSLuigi Rizzois a general purpose traffic source/sink. 811*17885a7bSLuigi Rizzo.Pp 812*17885a7bSLuigi RizzoAs an example 813*17885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f tx -l 60 814*17885a7bSLuigi Rizzocan generate an infinite stream of minimum size packets, and 815*17885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f rx 816*17885a7bSLuigi Rizzois a traffic sink. 817*17885a7bSLuigi RizzoBoth print traffic statistics, to help monitor 818*17885a7bSLuigi Rizzohow the system performs. 819*17885a7bSLuigi Rizzo.Pp 820*17885a7bSLuigi Rizzo.Xr pkt-gen 821*17885a7bSLuigi Rizzohas many options can be uses to set packet sizes, addresses, 822*17885a7bSLuigi Rizzorates, and use multiple send/receive threads and cores. 823*17885a7bSLuigi Rizzo.Pp 824*17885a7bSLuigi Rizzo.Xr bridge 825*17885a7bSLuigi Rizzois another test program which interconnects two 826*17885a7bSLuigi Rizzo.Nm 827*17885a7bSLuigi Rizzoports. It can be used for transparent forwarding between 828*17885a7bSLuigi Rizzointerfaces, as in 829*17885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix1 830*17885a7bSLuigi Rizzoor even connect the NIC to the host stack using netmap 831*17885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix0 832*17885a7bSLuigi Rizzo.Ss USING THE NATIVE API 83368b8534bSLuigi RizzoThe following code implements a traffic generator 83468b8534bSLuigi Rizzo.Pp 83568b8534bSLuigi Rizzo.Bd -literal -compact 83668b8534bSLuigi Rizzo#include <net/netmap_user.h> 837*17885a7bSLuigi Rizzo... 838*17885a7bSLuigi Rizzovoid sender(void) 839*17885a7bSLuigi Rizzo{ 84068b8534bSLuigi Rizzo struct netmap_if *nifp; 84168b8534bSLuigi Rizzo struct netmap_ring *ring; 842d83a410eSHiren Panchasara struct nmreq nmr; 843*17885a7bSLuigi Rizzo struct pollfd fds; 84468b8534bSLuigi Rizzo 84568b8534bSLuigi Rizzo fd = open("/dev/netmap", O_RDWR); 84668b8534bSLuigi Rizzo bzero(&nmr, sizeof(nmr)); 847d83a410eSHiren Panchasara strcpy(nmr.nr_name, "ix0"); 848ce3ee1e7SLuigi Rizzo nmr.nm_version = NETMAP_API; 849ce3ee1e7SLuigi Rizzo ioctl(fd, NIOCREGIF, &nmr); 850d83a410eSHiren Panchasara p = mmap(0, nmr.nr_memsize, fd); 851ce3ee1e7SLuigi Rizzo nifp = NETMAP_IF(p, nmr.nr_offset); 85268b8534bSLuigi Rizzo ring = NETMAP_TXRING(nifp, 0); 85368b8534bSLuigi Rizzo fds.fd = fd; 85468b8534bSLuigi Rizzo fds.events = POLLOUT; 85568b8534bSLuigi Rizzo for (;;) { 856*17885a7bSLuigi Rizzo poll(&fds, 1, -1); 857*17885a7bSLuigi Rizzo while (!nm_ring_empty(ring)) { 85868b8534bSLuigi Rizzo i = ring->cur; 85968b8534bSLuigi Rizzo buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 86068b8534bSLuigi Rizzo ... prepare packet in buf ... 86168b8534bSLuigi Rizzo ring->slot[i].len = ... packet length ... 862*17885a7bSLuigi Rizzo ring->head = ring->cur = nm_ring_next(ring, i); 863*17885a7bSLuigi Rizzo } 86468b8534bSLuigi Rizzo } 86568b8534bSLuigi Rizzo} 86668b8534bSLuigi Rizzo.Ed 867*17885a7bSLuigi Rizzo.Ss HELPER FUNCTIONS 868*17885a7bSLuigi RizzoA simple receiver can be implemented using the helper functions 869*17885a7bSLuigi Rizzo.Bd -literal -compact 870*17885a7bSLuigi Rizzo#define NETMAP_WITH_LIBS 871*17885a7bSLuigi Rizzo#include <net/netmap_user.h> 872*17885a7bSLuigi Rizzo... 873*17885a7bSLuigi Rizzovoid receiver(void) 874*17885a7bSLuigi Rizzo{ 875*17885a7bSLuigi Rizzo struct nm_desc_t *d; 876*17885a7bSLuigi Rizzo struct pollfd fds; 877*17885a7bSLuigi Rizzo u_char *buf; 878*17885a7bSLuigi Rizzo struct nm_hdr_t h; 879*17885a7bSLuigi Rizzo ... 880*17885a7bSLuigi Rizzo d = nm_open("netmap:ix0", NULL, 0, 0); 881*17885a7bSLuigi Rizzo fds.fd = NETMAP_FD(d); 882*17885a7bSLuigi Rizzo fds.events = POLLIN; 883*17885a7bSLuigi Rizzo for (;;) { 884*17885a7bSLuigi Rizzo poll(&fds, 1, -1); 885*17885a7bSLuigi Rizzo while ( (buf = nm_nextpkt(d, &h)) ) 886*17885a7bSLuigi Rizzo consume_pkt(buf, h->len); 887*17885a7bSLuigi Rizzo } 888*17885a7bSLuigi Rizzo nm_close(d); 889*17885a7bSLuigi Rizzo} 890*17885a7bSLuigi Rizzo.Ed 891*17885a7bSLuigi Rizzo.Ss ZERO-COPY FORWARDING 892*17885a7bSLuigi RizzoSince physical interfaces share the same memory region, 893*17885a7bSLuigi Rizzoit is possible to do packet forwarding between ports 894*17885a7bSLuigi Rizzoswapping buffers. The buffer from the transmit ring is used 895*17885a7bSLuigi Rizzoto replenish the receive ring: 896*17885a7bSLuigi Rizzo.Bd -literal -compact 897*17885a7bSLuigi Rizzo uint32_t tmp; 898*17885a7bSLuigi Rizzo struct netmap_slot *src, *dst; 899*17885a7bSLuigi Rizzo ... 900*17885a7bSLuigi Rizzo src = &src_ring->slot[rxr->cur]; 901*17885a7bSLuigi Rizzo dst = &dst_ring->slot[txr->cur]; 902*17885a7bSLuigi Rizzo tmp = dst->buf_idx; 903*17885a7bSLuigi Rizzo dst->buf_idx = src->buf_idx; 904*17885a7bSLuigi Rizzo dst->len = src->len; 905*17885a7bSLuigi Rizzo dst->flags = NS_BUF_CHANGED; 906*17885a7bSLuigi Rizzo src->buf_idx = tmp; 907*17885a7bSLuigi Rizzo src->flags = NS_BUF_CHANGED; 908*17885a7bSLuigi Rizzo rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 909*17885a7bSLuigi Rizzo txr->head = txr->cur = nm_ring_next(txr, txr->cur); 910*17885a7bSLuigi Rizzo ... 911*17885a7bSLuigi Rizzo.Ed 912*17885a7bSLuigi Rizzo.Ss ACCESSING THE HOST STACK 913*17885a7bSLuigi Rizzo.Ss VALE SWITCH 914*17885a7bSLuigi RizzoA simple way to test the performance of a 915*17885a7bSLuigi Rizzo.Nm VALE 916*17885a7bSLuigi Rizzoswitch is to attach a sender and a receiver to it, 917*17885a7bSLuigi Rizzoe.g. running the following in two different terminals: 918*17885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:a -f rx # receiver 919*17885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:b -f tx # sender 920*17885a7bSLuigi Rizzo.Pp 921*17885a7bSLuigi RizzoThe following command attaches an interface and the host stack 922*17885a7bSLuigi Rizzoto a switch: 923*17885a7bSLuigi Rizzo.Dl vale-ctl -h vale2:em0 924*17885a7bSLuigi RizzoOther 92568b8534bSLuigi Rizzo.Nm 926*17885a7bSLuigi Rizzoclients attached to the same switch can now communicate 927*17885a7bSLuigi Rizzowith the network card or the host. 928*17885a7bSLuigi Rizzo.Pp 92913a5d88fSLuigi Rizzo.Sh SEE ALSO 93013a5d88fSLuigi Rizzo.Pp 93113a5d88fSLuigi Rizzohttp://info.iet.unipi.it/~luigi/netmap/ 93213a5d88fSLuigi Rizzo.Pp 93313a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework, 93413a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012 93513a5d88fSLuigi Rizzo.Pp 93613a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O, 93713a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston 93868b8534bSLuigi Rizzo.Sh AUTHORS 93913a5d88fSLuigi Rizzo.An -nosplit 94068b8534bSLuigi RizzoThe 94168b8534bSLuigi Rizzo.Nm 942ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the 94313a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by 94413a5d88fSLuigi Rizzo.An Luigi Rizzo , 945ce3ee1e7SLuigi Rizzoand further extended with help from 94613a5d88fSLuigi Rizzo.An Matteo Landi , 94713a5d88fSLuigi Rizzo.An Gaetano Catalli , 948ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri , 949ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione . 95013a5d88fSLuigi Rizzo.Pp 95113a5d88fSLuigi Rizzo.Nm 952ce3ee1e7SLuigi Rizzoand 953ce3ee1e7SLuigi Rizzo.Nm VALE 954ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects 955ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581). 956*17885a7bSLuigi Rizzo.Pp 957*17885a7bSLuigi Rizzo.Ss SPECIAL MODES 958*17885a7bSLuigi RizzoWhen the device name has the form 959*17885a7bSLuigi Rizzo.Dl valeXXX:ifname (ifname is an existing interface) 960*17885a7bSLuigi Rizzothe physical interface 961*17885a7bSLuigi Rizzo(and optionally the corrisponding host stack endpoint) 962*17885a7bSLuigi Rizzoare connected or disconnected from the 963*17885a7bSLuigi Rizzo.Nm VALE 964*17885a7bSLuigi Rizzoswitch named XXX. 965*17885a7bSLuigi RizzoIn this case the 966*17885a7bSLuigi Rizzo.Pa ioctl() 967*17885a7bSLuigi Rizzois only used only for configuration, typically through the 968*17885a7bSLuigi Rizzo.Xr vale-ctl 969*17885a7bSLuigi Rizzocommand. 970*17885a7bSLuigi RizzoThe file descriptor cannot be used for I/O, and should be 971*17885a7bSLuigi Rizzoclosed after issuing the 972*17885a7bSLuigi Rizzo.Pa ioctl() . 973