1*ce3ee1e7SLuigi Rizzo.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa 268b8534bSLuigi Rizzo.\" All rights reserved. 368b8534bSLuigi Rizzo.\" 468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without 568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions 668b8534bSLuigi Rizzo.\" are met: 768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright 868b8534bSLuigi Rizzo.\" notice, this list of conditions and the following disclaimer. 968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright 1068b8534bSLuigi Rizzo.\" notice, this list of conditions and the following disclaimer in the 1168b8534bSLuigi Rizzo.\" documentation and/or other materials provided with the distribution. 1268b8534bSLuigi Rizzo.\" 1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 2368b8534bSLuigi Rizzo.\" SUCH DAMAGE. 2468b8534bSLuigi Rizzo.\" 2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4) 2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix. 2768b8534bSLuigi Rizzo.\" 2868b8534bSLuigi Rizzo.\" $FreeBSD$ 2968b8534bSLuigi Rizzo.\" 30*ce3ee1e7SLuigi Rizzo.Dd October 18, 2013 3168b8534bSLuigi Rizzo.Dt NETMAP 4 3268b8534bSLuigi Rizzo.Os 3368b8534bSLuigi Rizzo.Sh NAME 3468b8534bSLuigi Rizzo.Nm netmap 3568b8534bSLuigi Rizzo.Nd a framework for fast packet I/O 3668b8534bSLuigi Rizzo.Sh SYNOPSIS 3768b8534bSLuigi Rizzo.Cd device netmap 3868b8534bSLuigi Rizzo.Sh DESCRIPTION 3968b8534bSLuigi Rizzo.Nm 40*ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O 41*ce3ee1e7SLuigi Rizzo(reaching 14.88 Mpps with a single core at less than 1 GHz) 42*ce3ee1e7SLuigi Rizzofor both userspace and kernel clients. 43*ce3ee1e7SLuigi RizzoUserspace clients can use the netmap API 44*ce3ee1e7SLuigi Rizzoto send and receive raw packets through physical interfaces 45*ce3ee1e7SLuigi Rizzoor ports of the 46*ce3ee1e7SLuigi Rizzo.Xr VALE 4 47*ce3ee1e7SLuigi Rizzoswitch. 48*ce3ee1e7SLuigi Rizzo.Pp 49*ce3ee1e7SLuigi Rizzo.Nm VALE 50*ce3ee1e7SLuigi Rizzois a very fast (reaching 20 Mpps per port) 51*ce3ee1e7SLuigi Rizzoand modular software switch, 52*ce3ee1e7SLuigi Rizzoimplemented within the kernel, which can interconnect 53*ce3ee1e7SLuigi Rizzovirtual ports, physical devices, and the native host stack. 54*ce3ee1e7SLuigi Rizzo.Pp 5568b8534bSLuigi Rizzo.Nm 56*ce3ee1e7SLuigi Rizzouses a memory mapped region to share packet buffers, 57*ce3ee1e7SLuigi Rizzodescriptors and queues with the kernel. 58*ce3ee1e7SLuigi RizzoSimple 59*ce3ee1e7SLuigi Rizzo.Pa ioctl()s 60*ce3ee1e7SLuigi Rizzoare used to bind interfaces/ports to file descriptors and 61*ce3ee1e7SLuigi Rizzoimplement non-blocking I/O, whereas blocking I/O uses 6268b8534bSLuigi Rizzo.Pa select()/poll() . 6368b8534bSLuigi Rizzo.Nm 6468b8534bSLuigi Rizzocan exploit the parallelism in multiqueue devices and 6568b8534bSLuigi Rizzomulticore systems. 6668b8534bSLuigi Rizzo.Pp 67*ce3ee1e7SLuigi RizzoFor the best performance, 6868b8534bSLuigi Rizzo.Nm 69*ce3ee1e7SLuigi Rizzorequires explicit support in device drivers; 70*ce3ee1e7SLuigi Rizzoa generic emulation layer is available to implement the 7168b8534bSLuigi Rizzo.Nm 72*ce3ee1e7SLuigi RizzoAPI on top of unmodified device drivers, 73*ce3ee1e7SLuigi Rizzoat the price of reduced performance 74*ce3ee1e7SLuigi Rizzo(but still better than what can be achieved with 75*ce3ee1e7SLuigi Rizzosockets or BPF/pcap). 76*ce3ee1e7SLuigi Rizzo.Pp 77*ce3ee1e7SLuigi RizzoFor a list of devices with native 78*ce3ee1e7SLuigi Rizzo.Nm 79*ce3ee1e7SLuigi Rizzosupport, see the end of this manual page. 80*ce3ee1e7SLuigi Rizzo.Pp 81*ce3ee1e7SLuigi Rizzo.Sh OPERATION - THE NETMAP API 82*ce3ee1e7SLuigi Rizzo.Nm 83*ce3ee1e7SLuigi Rizzoclients must first 8468b8534bSLuigi Rizzo.Pa open("/dev/netmap") , 8568b8534bSLuigi Rizzoand then issue an 86*ce3ee1e7SLuigi Rizzo.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg) 87*ce3ee1e7SLuigi Rizzoto bind the file descriptor to a specific interface or port. 8868b8534bSLuigi Rizzo.Nm 89*ce3ee1e7SLuigi Rizzohas multiple modes of operation controlled by the 90*ce3ee1e7SLuigi Rizzocontent of the 91*ce3ee1e7SLuigi Rizzo.Pa struct nmreq 92*ce3ee1e7SLuigi Rizzopassed to the 93*ce3ee1e7SLuigi Rizzo.Pa ioctl() . 94*ce3ee1e7SLuigi RizzoIn particular, the 95*ce3ee1e7SLuigi Rizzo.Em nr_name 96*ce3ee1e7SLuigi Rizzofield specifies whether the client operates on a physical network 97*ce3ee1e7SLuigi Rizzointerface or on a port of a 98*ce3ee1e7SLuigi Rizzo.Nm VALE 99*ce3ee1e7SLuigi Rizzoswitch, as indicated below. Additional fields in the 100*ce3ee1e7SLuigi Rizzo.Pa struct nmreq 101*ce3ee1e7SLuigi Rizzocontrol the details of operation. 10268b8534bSLuigi Rizzo.Pp 103*ce3ee1e7SLuigi Rizzo.Bl -tag -width XXXX 104*ce3ee1e7SLuigi Rizzo.It Dv Interface name (e.g. 'em0', 'eth1', ... ) 105*ce3ee1e7SLuigi RizzoThe data path of the interface is disconnected from the host stack. 106*ce3ee1e7SLuigi RizzoDepending on additional arguments, 107*ce3ee1e7SLuigi Rizzothe file descriptor is bound to the NIC (one or all queues), 108*ce3ee1e7SLuigi Rizzoor to the host stack. 109*ce3ee1e7SLuigi Rizzo.It Dv valeXXX:YYY (arbitrary XXX and YYY) 110*ce3ee1e7SLuigi RizzoThe file descriptor is bound to port YYY of a VALE switch called XXX, 111*ce3ee1e7SLuigi Rizzowhere XXX and YYY are arbitrary alphanumeric strings. 112*ce3ee1e7SLuigi RizzoThe string cannot exceed IFNAMSIZ characters, and YYY cannot 113*ce3ee1e7SLuigi Rizzomatching the name of any existing interface. 114*ce3ee1e7SLuigi Rizzo.Pp 115*ce3ee1e7SLuigi RizzoThe switch and the port are created if not existing. 116*ce3ee1e7SLuigi Rizzo.It Dv valeXXX:ifname (ifname is an existing interface) 117*ce3ee1e7SLuigi RizzoFlags in the argument control whether the physical interface 118*ce3ee1e7SLuigi Rizzo(and optionally the corrisponding host stack endpoint) 119*ce3ee1e7SLuigi Rizzoare connected or disconnected from the VALE switch named XXX. 120*ce3ee1e7SLuigi Rizzo.Pp 121*ce3ee1e7SLuigi RizzoIn this case the 122*ce3ee1e7SLuigi Rizzo.Pa ioctl() 123*ce3ee1e7SLuigi Rizzois used only for configuring the VALE switch, typically through the 124*ce3ee1e7SLuigi Rizzo.Nm vale-ctl 125*ce3ee1e7SLuigi Rizzocommand. 126*ce3ee1e7SLuigi RizzoThe file descriptor cannot be used for I/O, and should be 127*ce3ee1e7SLuigi Rizzo.Pa close()d 128*ce3ee1e7SLuigi Rizzoafter issuing the 129*ce3ee1e7SLuigi Rizzo.Pa ioctl(). 130*ce3ee1e7SLuigi Rizzo.El 131*ce3ee1e7SLuigi Rizzo.Pp 132*ce3ee1e7SLuigi RizzoThe binding can be removed (and the interface returns to 133*ce3ee1e7SLuigi Rizzoregular operation, or the virtual port destroyed) with a 134*ce3ee1e7SLuigi Rizzo.Pa close() 135*ce3ee1e7SLuigi Rizzoon the file descriptor. 136*ce3ee1e7SLuigi Rizzo.Pp 137*ce3ee1e7SLuigi RizzoThe processes owning the file descriptor can then 138*ce3ee1e7SLuigi Rizzo.Pa mmap() 139*ce3ee1e7SLuigi Rizzothe memory region that contains pre-allocated 140*ce3ee1e7SLuigi Rizzobuffers, descriptors and queues, and use them to 141*ce3ee1e7SLuigi Rizzoread/write raw packets. 14268b8534bSLuigi RizzoNon blocking I/O is done with special 14368b8534bSLuigi Rizzo.Pa ioctl()'s , 14468b8534bSLuigi Rizzowhereas the file descriptor can be passed to 14568b8534bSLuigi Rizzo.Pa select()/poll() 14668b8534bSLuigi Rizzoto be notified about incoming packet or available transmit buffers. 147*ce3ee1e7SLuigi Rizzo.Ss DATA STRUCTURES 148*ce3ee1e7SLuigi RizzoThe data structures in the mmapped memory are described below 149*ce3ee1e7SLuigi Rizzo(see 150*ce3ee1e7SLuigi Rizzo.Xr sys/net/netmap.h 151*ce3ee1e7SLuigi Rizzofor reference). 152*ce3ee1e7SLuigi RizzoAll physical devices operating in 15368b8534bSLuigi Rizzo.Nm 154*ce3ee1e7SLuigi Rizzomode use the same memory region, 155*ce3ee1e7SLuigi Rizzoshared by the kernel and all processes who own 15668b8534bSLuigi Rizzo.Pa /dev/netmap 157*ce3ee1e7SLuigi Rizzodescriptors bound to those devices 15868b8534bSLuigi Rizzo(NOTE: visibility may be restricted in future implementations). 159*ce3ee1e7SLuigi RizzoVirtual ports instead use separate memory regions, 160*ce3ee1e7SLuigi Rizzoshared only with the kernel. 161*ce3ee1e7SLuigi Rizzo.Pp 16268b8534bSLuigi RizzoAll references between the shared data structure 16368b8534bSLuigi Rizzoare relative (offsets or indexes). Some macros help converting 16468b8534bSLuigi Rizzothem into actual pointers. 16568b8534bSLuigi Rizzo.Bl -tag -width XXX 16668b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface) 16768b8534bSLuigi Rizzoindicates the number of rings supported by an interface, their 16868b8534bSLuigi Rizzosizes, and the offsets of the 16968b8534bSLuigi Rizzo.Pa netmap_rings 17068b8534bSLuigi Rizzoassociated to the interface. 171*ce3ee1e7SLuigi Rizzo.Pp 17268b8534bSLuigi Rizzo.Pa struct netmap_if 173*ce3ee1e7SLuigi Rizzois at offset 17468b8534bSLuigi Rizzo.Pa nr_offset 175*ce3ee1e7SLuigi Rizzoin the shared memory region is indicated by the 17668b8534bSLuigi Rizzofield in the structure returned by the 17768b8534bSLuigi Rizzo.Pa NIOCREGIF 17868b8534bSLuigi Rizzo(see below). 17968b8534bSLuigi Rizzo.Bd -literal 18068b8534bSLuigi Rizzostruct netmap_if { 18168b8534bSLuigi Rizzo char ni_name[IFNAMSIZ]; /* name of the interface. */ 182*ce3ee1e7SLuigi Rizzo const u_int ni_version; /* API version */ 183*ce3ee1e7SLuigi Rizzo const u_int ni_rx_rings; /* number of rx ring pairs */ 184*ce3ee1e7SLuigi Rizzo const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */ 18568b8534bSLuigi Rizzo const ssize_t ring_ofs[]; /* offset of tx and rx rings */ 18668b8534bSLuigi Rizzo}; 18768b8534bSLuigi Rizzo.Ed 18868b8534bSLuigi Rizzo.It Dv struct netmap_ring (one per ring) 189*ce3ee1e7SLuigi RizzoContains the positions in the transmit and receive rings to 190*ce3ee1e7SLuigi Rizzosynchronize the kernel and the application, 19168b8534bSLuigi Rizzoand an array of 19268b8534bSLuigi Rizzo.Pa slots 19368b8534bSLuigi Rizzodescribing the buffers. 194*ce3ee1e7SLuigi Rizzo'reserved' is used in receive rings to tell the kernel the 195*ce3ee1e7SLuigi Rizzonumber of slots after 'cur' that are still in usr 196*ce3ee1e7SLuigi Rizzoindicates how many slots starting from 'cur' 197*ce3ee1e7SLuigi Rizzothe 198*ce3ee1e7SLuigi Rizzo.Pp 199*ce3ee1e7SLuigi RizzoEach physical interface has one 200*ce3ee1e7SLuigi Rizzo.Pa netmap_ring 201*ce3ee1e7SLuigi Rizzofor each hardware transmit and receive ring, 202*ce3ee1e7SLuigi Rizzoplus one extra transmit and one receive structure 203*ce3ee1e7SLuigi Rizzothat connect to the host stack. 20468b8534bSLuigi Rizzo.Bd -literal 20568b8534bSLuigi Rizzostruct netmap_ring { 206*ce3ee1e7SLuigi Rizzo const ssize_t buf_ofs; /* see details */ 207*ce3ee1e7SLuigi Rizzo const uint32_t num_slots; /* number of slots in the ring */ 20868b8534bSLuigi Rizzo uint32_t avail; /* number of usable slots */ 209*ce3ee1e7SLuigi Rizzo uint32_t cur; /* 'current' read/write index */ 21064ae02c3SLuigi Rizzo uint32_t reserved; /* not refilled before current */ 21168b8534bSLuigi Rizzo 21268b8534bSLuigi Rizzo const uint16_t nr_buf_size; 21368b8534bSLuigi Rizzo uint16_t flags; 214*ce3ee1e7SLuigi Rizzo#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */ 215*ce3ee1e7SLuigi Rizzo#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */ 216*ce3ee1e7SLuigi Rizzo#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */ 217*ce3ee1e7SLuigi Rizzo struct timeval ts; 218*ce3ee1e7SLuigi Rizzo struct netmap_slot slot[0]; /* array of slots */ 21968b8534bSLuigi Rizzo} 22068b8534bSLuigi Rizzo.Ed 221*ce3ee1e7SLuigi Rizzo.Pp 222*ce3ee1e7SLuigi RizzoIn transmit rings, after a system call 'cur' indicates 223*ce3ee1e7SLuigi Rizzothe first slot that can be used for transmissions, 224*ce3ee1e7SLuigi Rizzoand 'avail' reports how many of them are available. 225*ce3ee1e7SLuigi RizzoBefore the next netmap-related system call on the file 226*ce3ee1e7SLuigi Rizzodescriptor, the application should fill buffers and 227*ce3ee1e7SLuigi Rizzoslots with data, and update 'cur' and 'avail' 228*ce3ee1e7SLuigi Rizzoaccordingly, as shown in the figure below: 229*ce3ee1e7SLuigi Rizzo.Bd -literal 230*ce3ee1e7SLuigi Rizzo 231*ce3ee1e7SLuigi Rizzo cur 232*ce3ee1e7SLuigi Rizzo |----- avail ---| (after syscall) 233*ce3ee1e7SLuigi Rizzo v 234*ce3ee1e7SLuigi Rizzo TX [*****aaaaaaaaaaaaaaaaa**] 235*ce3ee1e7SLuigi Rizzo TX [*****TTTTTaaaaaaaaaaaa**] 236*ce3ee1e7SLuigi Rizzo ^ 237*ce3ee1e7SLuigi Rizzo |-- avail --| (before syscall) 238*ce3ee1e7SLuigi Rizzo cur 239*ce3ee1e7SLuigi Rizzo.Ed 240*ce3ee1e7SLuigi Rizzo 241*ce3ee1e7SLuigi RizzoIn receive rings, after a system call 'cur' indicates 242*ce3ee1e7SLuigi Rizzothe first slot that contains a valid packet, 243*ce3ee1e7SLuigi Rizzoand 'avail' reports how many of them are available. 244*ce3ee1e7SLuigi RizzoBefore the next netmap-related system call on the file 245*ce3ee1e7SLuigi Rizzodescriptor, the application can process buffers and 246*ce3ee1e7SLuigi Rizzorelease them to the kernel updating 247*ce3ee1e7SLuigi Rizzo'cur' and 'avail' accordingly, as shown in the figure below. 248*ce3ee1e7SLuigi RizzoReceive rings have an additional field called 'reserved' 249*ce3ee1e7SLuigi Rizzoto indicate how many buffers before 'cur' are still 250*ce3ee1e7SLuigi Rizzounder processing and cannot be released. 251*ce3ee1e7SLuigi Rizzo.Bd -literal 252*ce3ee1e7SLuigi Rizzo cur 253*ce3ee1e7SLuigi Rizzo |-res-|-- avail --| (after syscall) 254*ce3ee1e7SLuigi Rizzo v 255*ce3ee1e7SLuigi Rizzo RX [**rrrrrrRRRRRRRRRRRR******] 256*ce3ee1e7SLuigi Rizzo RX [**...........rrrrRRR******] 257*ce3ee1e7SLuigi Rizzo |res|--|<avail (before syscall) 258*ce3ee1e7SLuigi Rizzo ^ 259*ce3ee1e7SLuigi Rizzo cur 260*ce3ee1e7SLuigi Rizzo 261*ce3ee1e7SLuigi Rizzo.Ed 26268b8534bSLuigi Rizzo.It Dv struct netmap_slot (one per packet) 263*ce3ee1e7SLuigi Rizzocontains the metadata for a packet: 26468b8534bSLuigi Rizzo.Bd -literal 26568b8534bSLuigi Rizzostruct netmap_slot { 26668b8534bSLuigi Rizzo uint32_t buf_idx; /* buffer index */ 26768b8534bSLuigi Rizzo uint16_t len; /* packet length */ 26868b8534bSLuigi Rizzo uint16_t flags; /* buf changed, etc. */ 26968b8534bSLuigi Rizzo#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */ 27068b8534bSLuigi Rizzo#define NS_REPORT 0x0002 /* tell hw to report results 27168b8534bSLuigi Rizzo * e.g. by generating an interrupt 27268b8534bSLuigi Rizzo */ 273*ce3ee1e7SLuigi Rizzo#define NS_FORWARD 0x0004 /* pass packet to the other endpoint 274*ce3ee1e7SLuigi Rizzo * (host stack or device) 275*ce3ee1e7SLuigi Rizzo */ 276*ce3ee1e7SLuigi Rizzo#define NS_NO_LEARN 0x0008 277*ce3ee1e7SLuigi Rizzo#define NS_INDIRECT 0x0010 278*ce3ee1e7SLuigi Rizzo#define NS_MOREFRAG 0x0020 279*ce3ee1e7SLuigi Rizzo#define NS_PORT_SHIFT 8 280*ce3ee1e7SLuigi Rizzo#define NS_PORT_MASK (0xff << NS_PORT_SHIFT) 281*ce3ee1e7SLuigi Rizzo#define NS_RFRAGS(_slot) ( ((_slot)->flags >> 8) & 0xff) 282*ce3ee1e7SLuigi Rizzo uint64_t ptr; /* buffer address (indirect buffers) */ 28368b8534bSLuigi Rizzo}; 28468b8534bSLuigi Rizzo.Ed 285*ce3ee1e7SLuigi RizzoThe flags control how the the buffer associated to the slot 286*ce3ee1e7SLuigi Rizzoshould be managed. 28768b8534bSLuigi Rizzo.It Dv packet buffers 288*ce3ee1e7SLuigi Rizzoare normally fixed size (2 Kbyte) buffers allocated by the kernel 28968b8534bSLuigi Rizzothat contain packet data. Buffers addresses are computed through 29068b8534bSLuigi Rizzomacros. 29168b8534bSLuigi Rizzo.El 29268b8534bSLuigi Rizzo.Pp 293*ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX 29468b8534bSLuigi RizzoSome macros support the access to objects in the shared memory 295*ce3ee1e7SLuigi Rizzoregion. In particular, 296*ce3ee1e7SLuigi Rizzo.It NETMAP_TXRING(nifp, i) 297*ce3ee1e7SLuigi Rizzo.It NETMAP_RXRING(nifp, i) 298*ce3ee1e7SLuigi Rizzoreturn the address of the i-th transmit and receive ring, 299*ce3ee1e7SLuigi Rizzorespectively, whereas 300*ce3ee1e7SLuigi Rizzo.It NETMAP_BUF(ring, buf_idx) 301*ce3ee1e7SLuigi Rizzoreturns the address of the buffer with index buf_idx 302*ce3ee1e7SLuigi Rizzo(which can be part of any ring for the given interface). 303*ce3ee1e7SLuigi Rizzo.El 304*ce3ee1e7SLuigi Rizzo.Pp 305*ce3ee1e7SLuigi RizzoNormally, buffers are associated to slots when interfaces are bound, 306*ce3ee1e7SLuigi Rizzoand one packet is fully contained in a single buffer. 307*ce3ee1e7SLuigi RizzoClients can however modify the mapping using the 308*ce3ee1e7SLuigi Rizzofollowing flags: 309*ce3ee1e7SLuigi Rizzo.Ss FLAGS 310*ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX 311*ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED 312*ce3ee1e7SLuigi Rizzoindicates that the buf_idx in the slot has changed. 313*ce3ee1e7SLuigi RizzoThis can be useful if the client wants to implement 314*ce3ee1e7SLuigi Rizzosome form of zero-copy forwarding (e.g. by passing buffers 315*ce3ee1e7SLuigi Rizzofrom an input interface to an output interface), or 316*ce3ee1e7SLuigi Rizzoneeds to process packets out of order. 317*ce3ee1e7SLuigi Rizzo.Pp 318*ce3ee1e7SLuigi RizzoThe flag MUST be used whenever the buffer index is changed. 319*ce3ee1e7SLuigi Rizzo.It NS_REPORT 320*ce3ee1e7SLuigi Rizzoindicates that we want to be woken up when this buffer 321*ce3ee1e7SLuigi Rizzohas been transmitted. This reduces performance but insures 322*ce3ee1e7SLuigi Rizzoa prompt notification when a buffer has been sent. 323*ce3ee1e7SLuigi RizzoNormally, 324*ce3ee1e7SLuigi Rizzo.Nm 325*ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals 326*ce3ee1e7SLuigi Rizzocan be delayed indefinitely. However, we need such notifications 327*ce3ee1e7SLuigi Rizzobefore closing a descriptor. 328*ce3ee1e7SLuigi Rizzo.It NS_FORWARD 329*ce3ee1e7SLuigi RizzoWhen the device is open in 'transparent' mode, 330*ce3ee1e7SLuigi Rizzothe client can mark slots in receive rings with this flag. 331*ce3ee1e7SLuigi RizzoFor all marked slots, marked packets are forwarded to 332*ce3ee1e7SLuigi Rizzothe other endpoint at the next system call, thus restoring 333*ce3ee1e7SLuigi Rizzo(in a selective way) the connection between the NIC and the 334*ce3ee1e7SLuigi Rizzohost stack. 335*ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN 336*ce3ee1e7SLuigi Rizzotells the forwarding code that the SRC MAC address for this 337*ce3ee1e7SLuigi Rizzopacket should not be used in the learning bridge 338*ce3ee1e7SLuigi Rizzo.It NS_INDIRECT 339*ce3ee1e7SLuigi Rizzoindicates that the packet's payload is not in the netmap 340*ce3ee1e7SLuigi Rizzosupplied buffer, but in a user-supplied buffer whose 341*ce3ee1e7SLuigi Rizzouser virtual address is in the 'ptr' field of the slot. 342*ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes. 343*ce3ee1e7SLuigi Rizzo.Em This is only supported on the transmit ring of virtual ports 344*ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG 345*ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers; 346*ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear. 347*ce3ee1e7SLuigi RizzoThe maximum length of a chain is 64 buffers. 348*ce3ee1e7SLuigi Rizzo.Em This is only supported on virtual ports 349*ce3ee1e7SLuigi Rizzo.It ns_ctr 350*ce3ee1e7SLuigi Rizzoon receive rings, contains the number of remaining buffers 351*ce3ee1e7SLuigi Rizzoin a packet, including this one. 352*ce3ee1e7SLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set. 353*ce3ee1e7SLuigi RizzoThe length refers to the individual buffer, there is no 354*ce3ee1e7SLuigi Rizzofield for the total length 355*ce3ee1e7SLuigi RizzoXXX maybe put it in the ptr field ? 356*ce3ee1e7SLuigi Rizzo.Pp 357*ce3ee1e7SLuigi RizzoOn transmit rings, if NS_DST is set, it is passed to the lookup 358*ce3ee1e7SLuigi Rizzofunction, which can use it e.g. as the index of the destination 359*ce3ee1e7SLuigi Rizzoport instead of doing an address lookup. 360*ce3ee1e7SLuigi Rizzo.El 36113a5d88fSLuigi Rizzo.Sh IOCTLS 36268b8534bSLuigi Rizzo.Nm 36368b8534bSLuigi Rizzosupports some ioctl() to synchronize the state of the rings 36468b8534bSLuigi Rizzobetween the kernel and the user processes, plus some 36568b8534bSLuigi Rizzoto query and configure the interface. 36668b8534bSLuigi RizzoThe former do not require any argument, whereas the latter 36768b8534bSLuigi Rizzouse a 368*ce3ee1e7SLuigi Rizzo.Pa struct nmreq 36968b8534bSLuigi Rizzodefined as follows: 37068b8534bSLuigi Rizzo.Bd -literal 37168b8534bSLuigi Rizzostruct nmreq { 37268b8534bSLuigi Rizzo char nr_name[IFNAMSIZ]; 37364ae02c3SLuigi Rizzo uint32_t nr_version; /* API version */ 374*ce3ee1e7SLuigi Rizzo#define NETMAP_API 4 /* current version */ 37568b8534bSLuigi Rizzo uint32_t nr_offset; /* nifp offset in the shared region */ 37668b8534bSLuigi Rizzo uint32_t nr_memsize; /* size of the shared region */ 37764ae02c3SLuigi Rizzo uint32_t nr_tx_slots; /* slots in tx rings */ 37864ae02c3SLuigi Rizzo uint32_t nr_rx_slots; /* slots in rx rings */ 37964ae02c3SLuigi Rizzo uint16_t nr_tx_rings; /* number of tx rings */ 38064ae02c3SLuigi Rizzo uint16_t nr_rx_rings; /* number of tx rings */ 38168b8534bSLuigi Rizzo uint16_t nr_ringid; /* ring(s) we care about */ 38268b8534bSLuigi Rizzo#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */ 38368b8534bSLuigi Rizzo#define NETMAP_SW_RING 0x2000 /* we process the sw ring */ 38468b8534bSLuigi Rizzo#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ 38568b8534bSLuigi Rizzo#define NETMAP_RING_MASK 0xfff /* the actual ring number */ 386*ce3ee1e7SLuigi Rizzo uint16_t nr_cmd; 387*ce3ee1e7SLuigi Rizzo#define NETMAP_BDG_ATTACH 1 /* attach the NIC */ 388*ce3ee1e7SLuigi Rizzo#define NETMAP_BDG_DETACH 2 /* detach the NIC */ 389*ce3ee1e7SLuigi Rizzo#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */ 390*ce3ee1e7SLuigi Rizzo#define NETMAP_BDG_LIST 4 /* get bridge's info */ 391*ce3ee1e7SLuigi Rizzo uint16_t nr_arg1; 392*ce3ee1e7SLuigi Rizzo uint16_t nr_arg2; 393*ce3ee1e7SLuigi Rizzo uint32_t spare2[3]; 39468b8534bSLuigi Rizzo}; 39568b8534bSLuigi Rizzo 39668b8534bSLuigi Rizzo.Ed 39768b8534bSLuigi RizzoA device descriptor obtained through 39868b8534bSLuigi Rizzo.Pa /dev/netmap 39968b8534bSLuigi Rizzoalso supports the ioctl supported by network devices. 40068b8534bSLuigi Rizzo.Pp 40168b8534bSLuigi RizzoThe netmap-specific 40268b8534bSLuigi Rizzo.Xr ioctl 2 40368b8534bSLuigi Rizzocommand codes below are defined in 40468b8534bSLuigi Rizzo.In net/netmap.h 40568b8534bSLuigi Rizzoand are: 40668b8534bSLuigi Rizzo.Bl -tag -width XXXX 40768b8534bSLuigi Rizzo.It Dv NIOCGINFO 408*ce3ee1e7SLuigi Rizzoreturns EINVAL if the named device does not support netmap. 409*ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information 410*ce3ee1e7SLuigi Rizzoabout the interface. 411*ce3ee1e7SLuigi RizzoNote that all the information below can change before the 412*ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode. 41368b8534bSLuigi Rizzo.Pp 414*ce3ee1e7SLuigi Rizzo.Pa nr_memsize 415*ce3ee1e7SLuigi Rizzoindicates the size of the netmap 416*ce3ee1e7SLuigi Rizzomemory region. Physical devices all share the same memory region, 417*ce3ee1e7SLuigi Rizzowhereas VALE ports may have independent regions for each port. 418*ce3ee1e7SLuigi RizzoThese sizes can be set through system-wise sysctl variables. 419*ce3ee1e7SLuigi Rizzo.Pa nr_tx_slots, nr_rx_slots 420*ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings. 421*ce3ee1e7SLuigi Rizzo.Pa nr_tx_rings, nr_rx_rings 422*ce3ee1e7SLuigi Rizzoindicate the number of transmit 423*ce3ee1e7SLuigi Rizzoand receive rings. 424*ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime 425*ce3ee1e7SLuigi Rizzousing interface-specific functions (e.g. 426*ce3ee1e7SLuigi Rizzo.Pa sysctl 427*ce3ee1e7SLuigi Rizzoor 428*ce3ee1e7SLuigi Rizzo.Pa ethtool . 42968b8534bSLuigi Rizzo.It Dv NIOCREGIF 43068b8534bSLuigi Rizzoputs the interface named in nr_name into netmap mode, disconnecting 43168b8534bSLuigi Rizzoit from the host stack, and/or defines which rings are controlled 43268b8534bSLuigi Rizzothrough this file descriptor. 43368b8534bSLuigi RizzoOn return, it gives the same info as NIOCGINFO, and nr_ringid 43468b8534bSLuigi Rizzoindicates the identity of the rings controlled through the file 43568b8534bSLuigi Rizzodescriptor. 43668b8534bSLuigi Rizzo.Pp 43768b8534bSLuigi RizzoPossible values for nr_ringid are 43868b8534bSLuigi Rizzo.Bl -tag -width XXXXX 43968b8534bSLuigi Rizzo.It 0 44068b8534bSLuigi Rizzodefault, all hardware rings 44168b8534bSLuigi Rizzo.It NETMAP_SW_RING 44268b8534bSLuigi Rizzothe ``host rings'' connecting to the host stack 44368b8534bSLuigi Rizzo.It NETMAP_HW_RING + i 44468b8534bSLuigi Rizzothe i-th hardware ring 44568b8534bSLuigi Rizzo.El 44668b8534bSLuigi RizzoBy default, a 44768b8534bSLuigi Rizzo.Nm poll 44868b8534bSLuigi Rizzoor 44968b8534bSLuigi Rizzo.Nm select 45068b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if 45168b8534bSLuigi Rizzono write events are specified. 45268b8534bSLuigi RizzoThe feature can be disabled by or-ing 45368b8534bSLuigi Rizzo.Nm NETMAP_NO_TX_SYNC 45468b8534bSLuigi Rizzoto nr_ringid. 45568b8534bSLuigi RizzoBut normally you should keep this feature unless you are using 45668b8534bSLuigi Rizzoseparate file descriptors for the send and receive rings, because 45768b8534bSLuigi Rizzootherwise packets are pushed out only if NETMAP_TXSYNC is called, 45868b8534bSLuigi Rizzoor the send queue is full. 45968b8534bSLuigi Rizzo.Pp 46068b8534bSLuigi Rizzo.Pa NIOCREGIF 46168b8534bSLuigi Rizzocan be used multiple times to change the association of a 46268b8534bSLuigi Rizzofile descriptor to a ring pair, always within the same device. 463*ce3ee1e7SLuigi Rizzo.Pp 464*ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a 465*ce3ee1e7SLuigi Rizzo.Xr vale 4 466*ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default, 467*ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 46868b8534bSLuigi Rizzo.It Dv NIOCTXSYNC 46968b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the 47068b8534bSLuigi Rizzonumber of slots available for transmission. 47168b8534bSLuigi Rizzo.It Dv NIOCRXSYNC 47268b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available 47368b8534bSLuigi Rizzopackets. 47468b8534bSLuigi Rizzo.El 47513a5d88fSLuigi Rizzo.Sh SYSTEM CALLS 47668b8534bSLuigi Rizzo.Nm 47768b8534bSLuigi Rizzouses 478*ce3ee1e7SLuigi Rizzo.Xr select 2 47968b8534bSLuigi Rizzoand 480*ce3ee1e7SLuigi Rizzo.Xr poll 2 481*ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and 482*ce3ee1e7SLuigi Rizzo.Xr mmap 2 483*ce3ee1e7SLuigi Rizzoto map memory. 484*ce3ee1e7SLuigi Rizzo.Pp 485*ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to 486*ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard 487*ce3ee1e7SLuigi RizzoOS primitives, see 488*ce3ee1e7SLuigi Rizzo.Xr pthread 3 . 489*ce3ee1e7SLuigi RizzoIn particular, 490*ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3 491*ce3ee1e7SLuigi Rizzomay be of use. 49268b8534bSLuigi Rizzo.Sh EXAMPLES 49368b8534bSLuigi RizzoThe following code implements a traffic generator 49468b8534bSLuigi Rizzo.Pp 49568b8534bSLuigi Rizzo.Bd -literal -compact 49668b8534bSLuigi Rizzo#include <net/netmap.h> 49768b8534bSLuigi Rizzo#include <net/netmap_user.h> 49868b8534bSLuigi Rizzostruct netmap_if *nifp; 49968b8534bSLuigi Rizzostruct netmap_ring *ring; 500d83a410eSHiren Panchasarastruct nmreq nmr; 50168b8534bSLuigi Rizzo 50268b8534bSLuigi Rizzofd = open("/dev/netmap", O_RDWR); 50368b8534bSLuigi Rizzobzero(&nmr, sizeof(nmr)); 504d83a410eSHiren Panchasarastrcpy(nmr.nr_name, "ix0"); 505*ce3ee1e7SLuigi Rizzonmr.nm_version = NETMAP_API; 506*ce3ee1e7SLuigi Rizzoioctl(fd, NIOCREGIF, &nmr); 507d83a410eSHiren Panchasarap = mmap(0, nmr.nr_memsize, fd); 508*ce3ee1e7SLuigi Rizzonifp = NETMAP_IF(p, nmr.nr_offset); 50968b8534bSLuigi Rizzoring = NETMAP_TXRING(nifp, 0); 51068b8534bSLuigi Rizzofds.fd = fd; 51168b8534bSLuigi Rizzofds.events = POLLOUT; 51268b8534bSLuigi Rizzofor (;;) { 51368b8534bSLuigi Rizzo poll(list, 1, -1); 51413a5d88fSLuigi Rizzo for ( ; ring->avail > 0 ; ring->avail--) { 51568b8534bSLuigi Rizzo i = ring->cur; 51668b8534bSLuigi Rizzo buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 51768b8534bSLuigi Rizzo ... prepare packet in buf ... 51868b8534bSLuigi Rizzo ring->slot[i].len = ... packet length ... 51968b8534bSLuigi Rizzo ring->cur = NETMAP_RING_NEXT(ring, i); 52068b8534bSLuigi Rizzo } 52168b8534bSLuigi Rizzo} 52268b8534bSLuigi Rizzo.Ed 52368b8534bSLuigi Rizzo.Sh SUPPORTED INTERFACES 52468b8534bSLuigi Rizzo.Nm 52568b8534bSLuigi Rizzosupports the following interfaces: 52668b8534bSLuigi Rizzo.Xr em 4 , 52713a5d88fSLuigi Rizzo.Xr igb 4 , 52868b8534bSLuigi Rizzo.Xr ixgbe 4 , 52913a5d88fSLuigi Rizzo.Xr lem 4 , 53013a5d88fSLuigi Rizzo.Xr re 4 53113a5d88fSLuigi Rizzo.Sh SEE ALSO 53213a5d88fSLuigi Rizzo.Xr vale 4 53313a5d88fSLuigi Rizzo.Pp 53413a5d88fSLuigi Rizzohttp://info.iet.unipi.it/~luigi/netmap/ 53513a5d88fSLuigi Rizzo.Pp 53613a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework, 53713a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012 53813a5d88fSLuigi Rizzo.Pp 53913a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O, 54013a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston 54168b8534bSLuigi Rizzo.Sh AUTHORS 54213a5d88fSLuigi Rizzo.An -nosplit 54368b8534bSLuigi RizzoThe 54468b8534bSLuigi Rizzo.Nm 545*ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the 54613a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by 54713a5d88fSLuigi Rizzo.An Luigi Rizzo , 548*ce3ee1e7SLuigi Rizzoand further extended with help from 54913a5d88fSLuigi Rizzo.An Matteo Landi , 55013a5d88fSLuigi Rizzo.An Gaetano Catalli , 551*ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri , 552*ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione . 55313a5d88fSLuigi Rizzo.Pp 55413a5d88fSLuigi Rizzo.Nm 555*ce3ee1e7SLuigi Rizzoand 556*ce3ee1e7SLuigi Rizzo.Nm VALE 557*ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects 558*ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581). 559