117885a7bSLuigi Rizzo.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 268b8534bSLuigi Rizzo.\" All rights reserved. 368b8534bSLuigi Rizzo.\" 468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without 568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions 668b8534bSLuigi Rizzo.\" are met: 768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright 868b8534bSLuigi Rizzo.\" notice, this list of conditions and the following disclaimer. 968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright 1068b8534bSLuigi Rizzo.\" notice, this list of conditions and the following disclaimer in the 1168b8534bSLuigi Rizzo.\" documentation and/or other materials provided with the distribution. 1268b8534bSLuigi Rizzo.\" 1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 2368b8534bSLuigi Rizzo.\" SUCH DAMAGE. 2468b8534bSLuigi Rizzo.\" 2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4) 2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix. 2768b8534bSLuigi Rizzo.\" 2868b8534bSLuigi Rizzo.\" $FreeBSD$ 2968b8534bSLuigi Rizzo.\" 30*3f879a47SEnji Cooper.Dd March 2, 2017 3168b8534bSLuigi Rizzo.Dt NETMAP 4 3268b8534bSLuigi Rizzo.Os 3368b8534bSLuigi Rizzo.Sh NAME 3468b8534bSLuigi Rizzo.Nm netmap 3568b8534bSLuigi Rizzo.Nd a framework for fast packet I/O 3617885a7bSLuigi Rizzo.Nm VALE 3717885a7bSLuigi Rizzo.Nd a fast VirtuAl Local Ethernet using the netmap API 38*3f879a47SEnji Cooper.Pp 39fa7db06bSLuigi Rizzo.Nm netmap pipes 40fa7db06bSLuigi Rizzo.Nd a shared memory packet transport channel 4168b8534bSLuigi Rizzo.Sh SYNOPSIS 4268b8534bSLuigi Rizzo.Cd device netmap 4368b8534bSLuigi Rizzo.Sh DESCRIPTION 4468b8534bSLuigi Rizzo.Nm 45ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O 4637e3a6d3SLuigi Rizzofor userspace and kernel clients, and for Virtual Machines. 47e91d04f7SChristian BruefferIt runs on 48e91d04f7SChristian Brueffer.Fx 4937e3a6d3SLuigi RizzoLinux and some versions of Windows, and supports a variety of 5037e3a6d3SLuigi Rizzo.Nm netmap ports , 5137e3a6d3SLuigi Rizzoincluding 5237e3a6d3SLuigi Rizzo.Bl -tag -width XXXX 5337e3a6d3SLuigi Rizzo.It Nm physical NIC ports 5437e3a6d3SLuigi Rizzoto access individual queues of network interfaces; 5537e3a6d3SLuigi Rizzo.It Nm host ports 5637e3a6d3SLuigi Rizzoto inject packets into the host stack; 5737e3a6d3SLuigi Rizzo.It Nm VALE ports 5837e3a6d3SLuigi Rizzoimplementing a very fast and modular in-kernel software switch/dataplane; 5937e3a6d3SLuigi Rizzo.It Nm netmap pipes 6037e3a6d3SLuigi Rizzoa shared memory packet transport channel; 6137e3a6d3SLuigi Rizzo.It Nm netmap monitors 6237e3a6d3SLuigi Rizzoa mechanism similar to 63*3f879a47SEnji Cooper.Xr bpf 4 6437e3a6d3SLuigi Rizzoto capture traffic 6537e3a6d3SLuigi Rizzo.El 66fa7db06bSLuigi Rizzo.Pp 6737e3a6d3SLuigi RizzoAll these 6837e3a6d3SLuigi Rizzo.Nm netmap ports 6937e3a6d3SLuigi Rizzoare accessed interchangeably with the same API, 7037e3a6d3SLuigi Rizzoand are at least one order of magnitude faster than 71fa7db06bSLuigi Rizzostandard OS mechanisms 7237e3a6d3SLuigi Rizzo(sockets, bpf, tun/tap interfaces, native switches, pipes). 7337e3a6d3SLuigi RizzoWith suitably fast hardware (NICs, PCIe buses, CPUs), 7437e3a6d3SLuigi Rizzopacket I/O using 7537e3a6d3SLuigi Rizzo.Nm 7637e3a6d3SLuigi Rizzoon supported NICs 7737e3a6d3SLuigi Rizzoreaches 14.88 million packets per second (Mpps) 7837e3a6d3SLuigi Rizzowith much less than one core on 10 Gbit/s NICs; 7937e3a6d3SLuigi Rizzo35-40 Mpps on 40 Gbit/s NICs (limited by the hardware); 8037e3a6d3SLuigi Rizzoabout 20 Mpps per core for VALE ports; 8137e3a6d3SLuigi Rizzoand over 100 Mpps for 8237e3a6d3SLuigi Rizzo.Nm netmap pipes. 8337e3a6d3SLuigi RizzoNICs without native 8437e3a6d3SLuigi Rizzo.Nm 8537e3a6d3SLuigi Rizzosupport can still use the API in emulated mode, 8637e3a6d3SLuigi Rizzowhich uses unmodified device drivers and is 3-5 times faster than 87*3f879a47SEnji Cooper.Xr bpf 4 8837e3a6d3SLuigi Rizzoor raw sockets. 89ce3ee1e7SLuigi Rizzo.Pp 9017885a7bSLuigi RizzoUserspace clients can dynamically switch NICs into 9168b8534bSLuigi Rizzo.Nm 9217885a7bSLuigi Rizzomode and send and receive raw packets through 9317885a7bSLuigi Rizzomemory mapped buffers. 9417885a7bSLuigi RizzoSimilarly, 9517885a7bSLuigi Rizzo.Nm VALE 9637e3a6d3SLuigi Rizzoswitch instances and ports, 97fa7db06bSLuigi Rizzo.Nm netmap pipes 9837e3a6d3SLuigi Rizzoand 9937e3a6d3SLuigi Rizzo.Nm netmap monitors 100fa7db06bSLuigi Rizzocan be created dynamically, 10117885a7bSLuigi Rizzoproviding high speed packet I/O between processes, 10217885a7bSLuigi Rizzovirtual machines, NICs and the host stack. 10317885a7bSLuigi Rizzo.Pp 104fa7db06bSLuigi Rizzo.Nm 105e91d04f7SChristian Brueffersupports both non-blocking I/O through 106e91d04f7SChristian Brueffer.Xr ioctl 2 , 107fa7db06bSLuigi Rizzosynchronization and blocking I/O through a file descriptor 108fa7db06bSLuigi Rizzoand standard OS mechanisms such as 109fa7db06bSLuigi Rizzo.Xr select 2 , 110fa7db06bSLuigi Rizzo.Xr poll 2 , 111fa7db06bSLuigi Rizzo.Xr epoll 2 , 112e91d04f7SChristian Bruefferand 113fa7db06bSLuigi Rizzo.Xr kqueue 2 . 11437e3a6d3SLuigi RizzoAll types of 11537e3a6d3SLuigi Rizzo.Nm netmap ports 11637e3a6d3SLuigi Rizzoand the 11737e3a6d3SLuigi Rizzo.Nm VALE switch 118fa7db06bSLuigi Rizzoare implemented by a single kernel module, which also emulates the 119fa7db06bSLuigi Rizzo.Nm 12037e3a6d3SLuigi RizzoAPI over standard drivers. 12117885a7bSLuigi RizzoFor best performance, 12268b8534bSLuigi Rizzo.Nm 12337e3a6d3SLuigi Rizzorequires native support in device drivers. 12437e3a6d3SLuigi RizzoA list of such devices is at the end of this document. 125ce3ee1e7SLuigi Rizzo.Pp 12617885a7bSLuigi RizzoIn the rest of this (long) manual page we document 12717885a7bSLuigi Rizzovarious aspects of the 128ce3ee1e7SLuigi Rizzo.Nm 12917885a7bSLuigi Rizzoand 130ce3ee1e7SLuigi Rizzo.Nm VALE 13117885a7bSLuigi Rizzoarchitecture, features and usage. 13217885a7bSLuigi Rizzo.Sh ARCHITECTURE 13317885a7bSLuigi Rizzo.Nm 13417885a7bSLuigi Rizzosupports raw packet I/O through a 13517885a7bSLuigi Rizzo.Em port , 13617885a7bSLuigi Rizzowhich can be connected to a physical interface 13717885a7bSLuigi Rizzo.Em ( NIC ) , 13817885a7bSLuigi Rizzoto the host stack, 13917885a7bSLuigi Rizzoor to a 14017885a7bSLuigi Rizzo.Nm VALE 14137e3a6d3SLuigi Rizzoswitch. 14217885a7bSLuigi RizzoPorts use preallocated circular queues of buffers 14317885a7bSLuigi Rizzo.Em ( rings ) 14417885a7bSLuigi Rizzoresiding in an mmapped region. 14517885a7bSLuigi RizzoThere is one ring for each transmit/receive queue of a 14617885a7bSLuigi RizzoNIC or virtual port. 14717885a7bSLuigi RizzoAn additional ring pair connects to the host stack. 148ce3ee1e7SLuigi Rizzo.Pp 14917885a7bSLuigi RizzoAfter binding a file descriptor to a port, a 15017885a7bSLuigi Rizzo.Nm 15117885a7bSLuigi Rizzoclient can send or receive packets in batches through 15217885a7bSLuigi Rizzothe rings, and possibly implement zero-copy forwarding 15317885a7bSLuigi Rizzobetween ports. 154ce3ee1e7SLuigi Rizzo.Pp 15517885a7bSLuigi RizzoAll NICs operating in 15668b8534bSLuigi Rizzo.Nm 157ce3ee1e7SLuigi Rizzomode use the same memory region, 15817885a7bSLuigi Rizzoaccessible to all processes who own 159e91d04f7SChristian Brueffer.Pa /dev/netmap 16017885a7bSLuigi Rizzofile descriptors bound to NICs. 161fa7db06bSLuigi RizzoIndependent 16217885a7bSLuigi Rizzo.Nm VALE 163fa7db06bSLuigi Rizzoand 164fa7db06bSLuigi Rizzo.Nm netmap pipe 165fa7db06bSLuigi Rizzoports 166fa7db06bSLuigi Rizzoby default use separate memory regions, 167fa7db06bSLuigi Rizzobut can be independently configured to share memory. 16817885a7bSLuigi Rizzo.Sh ENTERING AND EXITING NETMAP MODE 169fa7db06bSLuigi RizzoThe following section describes the system calls to create 170fa7db06bSLuigi Rizzoand control 171fa7db06bSLuigi Rizzo.Nm netmap 172fa7db06bSLuigi Rizzoports (including 173fa7db06bSLuigi Rizzo.Nm VALE 174fa7db06bSLuigi Rizzoand 175fa7db06bSLuigi Rizzo.Nm netmap pipe 176fa7db06bSLuigi Rizzoports). 177*3f879a47SEnji CooperSimpler, higher level functions are described in the 178*3f879a47SEnji Cooper.Sx LIBRARIES 179*3f879a47SEnji Coopersection. 180fa7db06bSLuigi Rizzo.Pp 18117885a7bSLuigi RizzoPorts and rings are created and controlled through a file descriptor, 18217885a7bSLuigi Rizzocreated by opening a special device 18317885a7bSLuigi Rizzo.Dl fd = open("/dev/netmap"); 18417885a7bSLuigi Rizzoand then bound to a specific port with an 18517885a7bSLuigi Rizzo.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 18617885a7bSLuigi Rizzo.Pp 18717885a7bSLuigi Rizzo.Nm 18817885a7bSLuigi Rizzohas multiple modes of operation controlled by the 18917885a7bSLuigi Rizzo.Vt struct nmreq 19017885a7bSLuigi Rizzoargument. 19117885a7bSLuigi Rizzo.Va arg.nr_name 19237e3a6d3SLuigi Rizzospecifies the netmap port name, as follows: 19317885a7bSLuigi Rizzo.Bl -tag -width XXXX 194*3f879a47SEnji Cooper.It Dv OS network interface name (e.g., 'em0', 'eth1', ... ) 19517885a7bSLuigi Rizzothe data path of the NIC is disconnected from the host stack, 19617885a7bSLuigi Rizzoand the file descriptor is bound to the NIC (one or all queues), 19717885a7bSLuigi Rizzoor to the host stack; 19837e3a6d3SLuigi Rizzo.It Dv valeSSS:PPP 19937e3a6d3SLuigi Rizzothe file descriptor is bound to port PPP of VALE switch SSS. 20037e3a6d3SLuigi RizzoSwitch instances and ports are dynamically created if necessary. 201*3f879a47SEnji Cooper.Pp 20237e3a6d3SLuigi RizzoBoth SSS and PPP have the form [0-9a-zA-Z_]+ , the string 20337e3a6d3SLuigi Rizzocannot exceed IFNAMSIZ characters, and PPP cannot 20417885a7bSLuigi Rizzobe the name of any existing OS network interface. 20517885a7bSLuigi Rizzo.El 20617885a7bSLuigi Rizzo.Pp 20717885a7bSLuigi RizzoOn return, 20817885a7bSLuigi Rizzo.Va arg 20917885a7bSLuigi Rizzoindicates the size of the shared memory region, 21017885a7bSLuigi Rizzoand the number, size and location of all the 21117885a7bSLuigi Rizzo.Nm 21217885a7bSLuigi Rizzodata structures, which can be accessed by mmapping the memory 21317885a7bSLuigi Rizzo.Dl char *mem = mmap(0, arg.nr_memsize, fd); 21417885a7bSLuigi Rizzo.Pp 215e91d04f7SChristian BruefferNon-blocking I/O is done with special 21617885a7bSLuigi Rizzo.Xr ioctl 2 21717885a7bSLuigi Rizzo.Xr select 2 21817885a7bSLuigi Rizzoand 21917885a7bSLuigi Rizzo.Xr poll 2 22017885a7bSLuigi Rizzoon the file descriptor permit blocking I/O. 22117885a7bSLuigi Rizzo.Xr epoll 2 22217885a7bSLuigi Rizzoand 22317885a7bSLuigi Rizzo.Xr kqueue 2 22417885a7bSLuigi Rizzoare not supported on 22517885a7bSLuigi Rizzo.Nm 22617885a7bSLuigi Rizzofile descriptors. 22717885a7bSLuigi Rizzo.Pp 22817885a7bSLuigi RizzoWhile a NIC is in 22917885a7bSLuigi Rizzo.Nm 23017885a7bSLuigi Rizzomode, the OS will still believe the interface is up and running. 23117885a7bSLuigi RizzoOS-generated packets for that NIC end up into a 23217885a7bSLuigi Rizzo.Nm 23317885a7bSLuigi Rizzoring, and another ring is used to send packets into the OS network stack. 23417885a7bSLuigi RizzoA 23517885a7bSLuigi Rizzo.Xr close 2 23617885a7bSLuigi Rizzoon the file descriptor removes the binding, 23717885a7bSLuigi Rizzoand returns the NIC to normal mode (reconnecting the data path 23817885a7bSLuigi Rizzoto the host stack), or destroys the virtual port. 23917885a7bSLuigi Rizzo.Sh DATA STRUCTURES 24017885a7bSLuigi RizzoThe data structures in the mmapped memory region are detailed in 241e91d04f7SChristian Brueffer.In sys/net/netmap.h , 24217885a7bSLuigi Rizzowhich is the ultimate reference for the 24317885a7bSLuigi Rizzo.Nm 244e91d04f7SChristian BruefferAPI. 245e91d04f7SChristian BruefferThe main structures and fields are indicated below: 24668b8534bSLuigi Rizzo.Bl -tag -width XXX 24768b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface) 24868b8534bSLuigi Rizzo.Bd -literal 24968b8534bSLuigi Rizzostruct netmap_if { 25017885a7bSLuigi Rizzo ... 25117885a7bSLuigi Rizzo const uint32_t ni_flags; /* properties */ 25217885a7bSLuigi Rizzo ... 25317885a7bSLuigi Rizzo const uint32_t ni_tx_rings; /* NIC tx rings */ 25417885a7bSLuigi Rizzo const uint32_t ni_rx_rings; /* NIC rx rings */ 255fa7db06bSLuigi Rizzo uint32_t ni_bufs_head; /* head of extra bufs list */ 25617885a7bSLuigi Rizzo ... 25768b8534bSLuigi Rizzo}; 25868b8534bSLuigi Rizzo.Ed 259ce3ee1e7SLuigi Rizzo.Pp 26017885a7bSLuigi RizzoIndicates the number of available rings 26117885a7bSLuigi Rizzo.Pa ( struct netmap_rings ) 26217885a7bSLuigi Rizzoand their position in the mmapped region. 26317885a7bSLuigi RizzoThe number of tx and rx rings 26417885a7bSLuigi Rizzo.Pa ( ni_tx_rings , ni_rx_rings ) 26517885a7bSLuigi Rizzonormally depends on the hardware. 26617885a7bSLuigi RizzoNICs also have an extra tx/rx ring pair connected to the host stack. 26717885a7bSLuigi Rizzo.Em NIOCREGIF 268fa7db06bSLuigi Rizzocan also request additional unbound buffers in the same memory space, 269fa7db06bSLuigi Rizzoto be used as temporary storage for packets. 270fa7db06bSLuigi Rizzo.Pa ni_bufs_head 271fa7db06bSLuigi Rizzocontains the index of the first of these free rings, 272fa7db06bSLuigi Rizzowhich are connected in a list (the first uint32_t of each 273fa7db06bSLuigi Rizzobuffer being the index of the next buffer in the list). 274e91d04f7SChristian BruefferA 275e91d04f7SChristian Brueffer.Dv 0 276e91d04f7SChristian Bruefferindicates the end of the list. 27717885a7bSLuigi Rizzo.It Dv struct netmap_ring (one per ring) 27868b8534bSLuigi Rizzo.Bd -literal 27968b8534bSLuigi Rizzostruct netmap_ring { 28017885a7bSLuigi Rizzo ... 28117885a7bSLuigi Rizzo const uint32_t num_slots; /* slots in each ring */ 28217885a7bSLuigi Rizzo const uint32_t nr_buf_size; /* size of each buffer */ 28317885a7bSLuigi Rizzo ... 28417885a7bSLuigi Rizzo uint32_t head; /* (u) first buf owned by user */ 28517885a7bSLuigi Rizzo uint32_t cur; /* (u) wakeup position */ 28617885a7bSLuigi Rizzo const uint32_t tail; /* (k) first buf owned by kernel */ 28717885a7bSLuigi Rizzo ... 28817885a7bSLuigi Rizzo uint32_t flags; 28917885a7bSLuigi Rizzo struct timeval ts; /* (k) time of last rxsync() */ 29017885a7bSLuigi Rizzo ... 291ce3ee1e7SLuigi Rizzo struct netmap_slot slot[0]; /* array of slots */ 29268b8534bSLuigi Rizzo} 29368b8534bSLuigi Rizzo.Ed 294ce3ee1e7SLuigi Rizzo.Pp 29517885a7bSLuigi RizzoImplements transmit and receive rings, with read/write 296e91d04f7SChristian Bruefferpointers, metadata and an array of 297e91d04f7SChristian Brueffer.Em slots 29817885a7bSLuigi Rizzodescribing the buffers. 29917885a7bSLuigi Rizzo.It Dv struct netmap_slot (one per buffer) 30068b8534bSLuigi Rizzo.Bd -literal 30168b8534bSLuigi Rizzostruct netmap_slot { 30268b8534bSLuigi Rizzo uint32_t buf_idx; /* buffer index */ 30368b8534bSLuigi Rizzo uint16_t len; /* packet length */ 30468b8534bSLuigi Rizzo uint16_t flags; /* buf changed, etc. */ 30517885a7bSLuigi Rizzo uint64_t ptr; /* address for indirect buffers */ 30668b8534bSLuigi Rizzo}; 30768b8534bSLuigi Rizzo.Ed 30817885a7bSLuigi Rizzo.Pp 30917885a7bSLuigi RizzoDescribes a packet buffer, which normally is identified by 31017885a7bSLuigi Rizzoan index and resides in the mmapped region. 31168b8534bSLuigi Rizzo.It Dv packet buffers 31217885a7bSLuigi RizzoFixed size (normally 2 KB) packet buffers allocated by the kernel. 313ce3ee1e7SLuigi Rizzo.El 314ce3ee1e7SLuigi Rizzo.Pp 31517885a7bSLuigi RizzoThe offset of the 31617885a7bSLuigi Rizzo.Pa struct netmap_if 31717885a7bSLuigi Rizzoin the mmapped region is indicated by the 31817885a7bSLuigi Rizzo.Pa nr_offset 31917885a7bSLuigi Rizzofield in the structure returned by 320e91d04f7SChristian Brueffer.Dv NIOCREGIF . 32117885a7bSLuigi RizzoFrom there, all other objects are reachable through 32217885a7bSLuigi Rizzorelative references (offsets or indexes). 323e91d04f7SChristian BruefferMacros and functions in 324e91d04f7SChristian Brueffer.In net/netmap_user.h 32517885a7bSLuigi Rizzohelp converting them into actual pointers: 32617885a7bSLuigi Rizzo.Pp 32717885a7bSLuigi Rizzo.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 32817885a7bSLuigi Rizzo.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 32917885a7bSLuigi Rizzo.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 33017885a7bSLuigi Rizzo.Pp 33117885a7bSLuigi Rizzo.Dl char *buf = NETMAP_BUF(ring, buffer_index); 33217885a7bSLuigi Rizzo.Sh RINGS, BUFFERS AND DATA I/O 33317885a7bSLuigi Rizzo.Va Rings 33417885a7bSLuigi Rizzoare circular queues of packets with three indexes/pointers 33517885a7bSLuigi Rizzo.Va ( head , cur , tail ) ; 33617885a7bSLuigi Rizzoone slot is always kept empty. 33717885a7bSLuigi RizzoThe ring size 33817885a7bSLuigi Rizzo.Va ( num_slots ) 33917885a7bSLuigi Rizzoshould not be assumed to be a power of two. 34017885a7bSLuigi Rizzo.Pp 34117885a7bSLuigi Rizzo.Va head 34217885a7bSLuigi Rizzois the first slot available to userspace; 343*3f879a47SEnji Cooper.Pp 34417885a7bSLuigi Rizzo.Va cur 34517885a7bSLuigi Rizzois the wakeup point: 34617885a7bSLuigi Rizzoselect/poll will unblock when 34717885a7bSLuigi Rizzo.Va tail 34817885a7bSLuigi Rizzopasses 34917885a7bSLuigi Rizzo.Va cur ; 350*3f879a47SEnji Cooper.Pp 35117885a7bSLuigi Rizzo.Va tail 35217885a7bSLuigi Rizzois the first slot reserved to the kernel. 35317885a7bSLuigi Rizzo.Pp 354e91d04f7SChristian BruefferSlot indexes 355e91d04f7SChristian Brueffer.Em must 356e91d04f7SChristian Bruefferonly move forward; 35717885a7bSLuigi Rizzofor convenience, the function 35817885a7bSLuigi Rizzo.Dl nm_ring_next(ring, index) 35917885a7bSLuigi Rizzoreturns the next index modulo the ring size. 36017885a7bSLuigi Rizzo.Pp 36117885a7bSLuigi Rizzo.Va head 36217885a7bSLuigi Rizzoand 36317885a7bSLuigi Rizzo.Va cur 36417885a7bSLuigi Rizzoare only modified by the user program; 36517885a7bSLuigi Rizzo.Va tail 36617885a7bSLuigi Rizzois only modified by the kernel. 36717885a7bSLuigi RizzoThe kernel only reads/writes the 36817885a7bSLuigi Rizzo.Vt struct netmap_ring 36917885a7bSLuigi Rizzoslots and buffers 37017885a7bSLuigi Rizzoduring the execution of a netmap-related system call. 37117885a7bSLuigi RizzoThe only exception are slots (and buffers) in the range 37217885a7bSLuigi Rizzo.Va tail\ . . . head-1 , 37317885a7bSLuigi Rizzothat are explicitly assigned to the kernel. 37417885a7bSLuigi Rizzo.Pp 37517885a7bSLuigi Rizzo.Ss TRANSMIT RINGS 37617885a7bSLuigi RizzoOn transmit rings, after a 37717885a7bSLuigi Rizzo.Nm 37817885a7bSLuigi Rizzosystem call, slots in the range 37917885a7bSLuigi Rizzo.Va head\ . . . tail-1 38017885a7bSLuigi Rizzoare available for transmission. 38117885a7bSLuigi RizzoUser code should fill the slots sequentially 38217885a7bSLuigi Rizzoand advance 38317885a7bSLuigi Rizzo.Va head 38417885a7bSLuigi Rizzoand 38517885a7bSLuigi Rizzo.Va cur 38617885a7bSLuigi Rizzopast slots ready to transmit. 38717885a7bSLuigi Rizzo.Va cur 38817885a7bSLuigi Rizzomay be moved further ahead if the user code needs 38917885a7bSLuigi Rizzomore slots before further transmissions (see 39017885a7bSLuigi Rizzo.Sx SCATTER GATHER I/O ) . 39117885a7bSLuigi Rizzo.Pp 39217885a7bSLuigi RizzoAt the next NIOCTXSYNC/select()/poll(), 39317885a7bSLuigi Rizzoslots up to 39417885a7bSLuigi Rizzo.Va head-1 39517885a7bSLuigi Rizzoare pushed to the port, and 39617885a7bSLuigi Rizzo.Va tail 39717885a7bSLuigi Rizzomay advance if further slots have become available. 39817885a7bSLuigi RizzoBelow is an example of the evolution of a TX ring: 39917885a7bSLuigi Rizzo.Bd -literal 40017885a7bSLuigi Rizzo after the syscall, slots between cur and tail are (a)vailable 40117885a7bSLuigi Rizzo head=cur tail 40217885a7bSLuigi Rizzo | | 40317885a7bSLuigi Rizzo v v 40417885a7bSLuigi Rizzo TX [.....aaaaaaaaaaa.............] 40517885a7bSLuigi Rizzo 40617885a7bSLuigi Rizzo user creates new packets to (T)ransmit 40717885a7bSLuigi Rizzo head=cur tail 40817885a7bSLuigi Rizzo | | 40917885a7bSLuigi Rizzo v v 41017885a7bSLuigi Rizzo TX [.....TTTTTaaaaaa.............] 41117885a7bSLuigi Rizzo 41217885a7bSLuigi Rizzo NIOCTXSYNC/poll()/select() sends packets and reports new slots 41317885a7bSLuigi Rizzo head=cur tail 41417885a7bSLuigi Rizzo | | 41517885a7bSLuigi Rizzo v v 41617885a7bSLuigi Rizzo TX [..........aaaaaaaaaaa........] 41717885a7bSLuigi Rizzo.Ed 41817885a7bSLuigi Rizzo.Pp 419e91d04f7SChristian Brueffer.Fn select 420e91d04f7SChristian Bruefferand 421e91d04f7SChristian Brueffer.Fn poll 422*3f879a47SEnji Cooperwill block if there is no space in the ring, i.e., 42317885a7bSLuigi Rizzo.Dl ring->cur == ring->tail 42417885a7bSLuigi Rizzoand return when new slots have become available. 42517885a7bSLuigi Rizzo.Pp 42617885a7bSLuigi RizzoHigh speed applications may want to amortize the cost of system calls 42717885a7bSLuigi Rizzoby preparing as many packets as possible before issuing them. 42817885a7bSLuigi Rizzo.Pp 42917885a7bSLuigi RizzoA transmit ring with pending transmissions has 43017885a7bSLuigi Rizzo.Dl ring->head != ring->tail + 1 (modulo the ring size). 43117885a7bSLuigi RizzoThe function 43217885a7bSLuigi Rizzo.Va int nm_tx_pending(ring) 43317885a7bSLuigi Rizzoimplements this test. 43417885a7bSLuigi Rizzo.Ss RECEIVE RINGS 43517885a7bSLuigi RizzoOn receive rings, after a 43617885a7bSLuigi Rizzo.Nm 43717885a7bSLuigi Rizzosystem call, the slots in the range 43817885a7bSLuigi Rizzo.Va head\& . . . tail-1 43917885a7bSLuigi Rizzocontain received packets. 44017885a7bSLuigi RizzoUser code should process them and advance 44117885a7bSLuigi Rizzo.Va head 44217885a7bSLuigi Rizzoand 44317885a7bSLuigi Rizzo.Va cur 44417885a7bSLuigi Rizzopast slots it wants to return to the kernel. 44517885a7bSLuigi Rizzo.Va cur 44617885a7bSLuigi Rizzomay be moved further ahead if the user code wants to 44717885a7bSLuigi Rizzowait for more packets 44817885a7bSLuigi Rizzowithout returning all the previous slots to the kernel. 44917885a7bSLuigi Rizzo.Pp 45017885a7bSLuigi RizzoAt the next NIOCRXSYNC/select()/poll(), 45117885a7bSLuigi Rizzoslots up to 45217885a7bSLuigi Rizzo.Va head-1 45317885a7bSLuigi Rizzoare returned to the kernel for further receives, and 45417885a7bSLuigi Rizzo.Va tail 45517885a7bSLuigi Rizzomay advance to report new incoming packets. 456*3f879a47SEnji Cooper.Pp 45717885a7bSLuigi RizzoBelow is an example of the evolution of an RX ring: 45817885a7bSLuigi Rizzo.Bd -literal 45917885a7bSLuigi Rizzo after the syscall, there are some (h)eld and some (R)eceived slots 46017885a7bSLuigi Rizzo head cur tail 46117885a7bSLuigi Rizzo | | | 46217885a7bSLuigi Rizzo v v v 46317885a7bSLuigi Rizzo RX [..hhhhhhRRRRRRRR..........] 46417885a7bSLuigi Rizzo 46517885a7bSLuigi Rizzo user advances head and cur, releasing some slots and holding others 46617885a7bSLuigi Rizzo head cur tail 46717885a7bSLuigi Rizzo | | | 46817885a7bSLuigi Rizzo v v v 46917885a7bSLuigi Rizzo RX [..*****hhhRRRRRR...........] 47017885a7bSLuigi Rizzo 47117885a7bSLuigi Rizzo NICRXSYNC/poll()/select() recovers slots and reports new packets 47217885a7bSLuigi Rizzo head cur tail 47317885a7bSLuigi Rizzo | | | 47417885a7bSLuigi Rizzo v v v 47517885a7bSLuigi Rizzo RX [.......hhhRRRRRRRRRRRR....] 47617885a7bSLuigi Rizzo.Ed 47717885a7bSLuigi Rizzo.Sh SLOTS AND PACKET BUFFERS 47817885a7bSLuigi RizzoNormally, packets should be stored in the netmap-allocated buffers 47917885a7bSLuigi Rizzoassigned to slots when ports are bound to a file descriptor. 48017885a7bSLuigi RizzoOne packet is fully contained in a single buffer. 48117885a7bSLuigi Rizzo.Pp 48217885a7bSLuigi RizzoThe following flags affect slot and buffer processing: 483ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX 484ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED 485e91d04f7SChristian Brueffer.Em must 486e91d04f7SChristian Bruefferbe used when the 487e91d04f7SChristian Brueffer.Va buf_idx 488e91d04f7SChristian Bruefferin the slot is changed. 48917885a7bSLuigi RizzoThis can be used to implement 49017885a7bSLuigi Rizzozero-copy forwarding, see 49117885a7bSLuigi Rizzo.Sx ZERO-COPY FORWARDING . 492ce3ee1e7SLuigi Rizzo.It NS_REPORT 49317885a7bSLuigi Rizzoreports when this buffer has been transmitted. 494ce3ee1e7SLuigi RizzoNormally, 495ce3ee1e7SLuigi Rizzo.Nm 496ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals 497e91d04f7SChristian Brueffercan be delayed indefinitely. 498e91d04f7SChristian BruefferThis flag helps detect 499e91d04f7SChristian Bruefferwhen packets have been sent and a file descriptor can be closed. 500ce3ee1e7SLuigi Rizzo.It NS_FORWARD 50117885a7bSLuigi RizzoWhen a ring is in 'transparent' mode (see 50217885a7bSLuigi Rizzo.Sx TRANSPARENT MODE ) , 503e91d04f7SChristian Bruefferpackets marked with this flag are forwarded to the other endpoint 50417885a7bSLuigi Rizzoat the next system call, thus restoring (in a selective way) 50517885a7bSLuigi Rizzothe connection between a NIC and the host stack. 506ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN 507e91d04f7SChristian Brueffertells the forwarding code that the source MAC address for this 50817885a7bSLuigi Rizzopacket must not be used in the learning bridge code. 509ce3ee1e7SLuigi Rizzo.It NS_INDIRECT 510e91d04f7SChristian Bruefferindicates that the packet's payload is in a user-supplied buffer 51117885a7bSLuigi Rizzowhose user virtual address is in the 'ptr' field of the slot. 512ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes. 513*3f879a47SEnji Cooper.Pp 51417885a7bSLuigi RizzoThis is only supported on the transmit ring of 51517885a7bSLuigi Rizzo.Nm VALE 51617885a7bSLuigi Rizzoports, and it helps reducing data copies in the interconnection 51717885a7bSLuigi Rizzoof virtual machines. 518ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG 519ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers; 520ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear. 521ce3ee1e7SLuigi Rizzo.El 52217885a7bSLuigi Rizzo.Sh SCATTER GATHER I/O 52317885a7bSLuigi RizzoPackets can span multiple slots if the 52417885a7bSLuigi Rizzo.Va NS_MOREFRAG 52517885a7bSLuigi Rizzoflag is set in all but the last slot. 52617885a7bSLuigi RizzoThe maximum length of a chain is 64 buffers. 52717885a7bSLuigi RizzoThis is normally used with 52817885a7bSLuigi Rizzo.Nm VALE 52917885a7bSLuigi Rizzoports when connecting virtual machines, as they generate large 53017885a7bSLuigi RizzoTSO segments that are not split unless they reach a physical device. 53117885a7bSLuigi Rizzo.Pp 53217885a7bSLuigi RizzoNOTE: The length field always refers to the individual 53317885a7bSLuigi Rizzofragment; there is no place with the total length of a packet. 53417885a7bSLuigi Rizzo.Pp 53517885a7bSLuigi RizzoOn receive rings the macro 53617885a7bSLuigi Rizzo.Va NS_RFRAGS(slot) 53717885a7bSLuigi Rizzoindicates the remaining number of slots for this packet, 53817885a7bSLuigi Rizzoincluding the current one. 53917885a7bSLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set. 54013a5d88fSLuigi Rizzo.Sh IOCTLS 54168b8534bSLuigi Rizzo.Nm 54217885a7bSLuigi Rizzouses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 543e91d04f7SChristian Bruefferfor non-blocking I/O. 544e91d04f7SChristian BruefferThey take no argument. 54517885a7bSLuigi RizzoTwo more ioctls (NIOCGINFO, NIOCREGIF) are used 54617885a7bSLuigi Rizzoto query and configure ports, with the following argument: 54768b8534bSLuigi Rizzo.Bd -literal 54868b8534bSLuigi Rizzostruct nmreq { 54917885a7bSLuigi Rizzo char nr_name[IFNAMSIZ]; /* (i) port name */ 55017885a7bSLuigi Rizzo uint32_t nr_version; /* (i) API version */ 55117885a7bSLuigi Rizzo uint32_t nr_offset; /* (o) nifp offset in mmap region */ 55217885a7bSLuigi Rizzo uint32_t nr_memsize; /* (o) size of the mmap region */ 553fa7db06bSLuigi Rizzo uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ 554fa7db06bSLuigi Rizzo uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ 555fa7db06bSLuigi Rizzo uint16_t nr_tx_rings; /* (i/o) number of tx rings */ 556e91d04f7SChristian Brueffer uint16_t nr_rx_rings; /* (i/o) number of rx rings */ 557fa7db06bSLuigi Rizzo uint16_t nr_ringid; /* (i/o) ring(s) we care about */ 55817885a7bSLuigi Rizzo uint16_t nr_cmd; /* (i) special command */ 559fa7db06bSLuigi Rizzo uint16_t nr_arg1; /* (i/o) extra arguments */ 560fa7db06bSLuigi Rizzo uint16_t nr_arg2; /* (i/o) extra arguments */ 561fa7db06bSLuigi Rizzo uint32_t nr_arg3; /* (i/o) extra arguments */ 562fa7db06bSLuigi Rizzo uint32_t nr_flags /* (i/o) open mode */ 56317885a7bSLuigi Rizzo ... 56468b8534bSLuigi Rizzo}; 56568b8534bSLuigi Rizzo.Ed 56668b8534bSLuigi Rizzo.Pp 56717885a7bSLuigi RizzoA file descriptor obtained through 56817885a7bSLuigi Rizzo.Pa /dev/netmap 56917885a7bSLuigi Rizzoalso supports the ioctl supported by network devices, see 57017885a7bSLuigi Rizzo.Xr netintro 4 . 57168b8534bSLuigi Rizzo.Bl -tag -width XXXX 57268b8534bSLuigi Rizzo.It Dv NIOCGINFO 57317885a7bSLuigi Rizzoreturns EINVAL if the named port does not support netmap. 574ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information 57517885a7bSLuigi Rizzoabout the port. 576ce3ee1e7SLuigi RizzoNote that all the information below can change before the 577ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode. 57817885a7bSLuigi Rizzo.Bl -tag -width XX 57917885a7bSLuigi Rizzo.It Pa nr_memsize 58017885a7bSLuigi Rizzoindicates the size of the 58117885a7bSLuigi Rizzo.Nm 582e91d04f7SChristian Brueffermemory region. 583e91d04f7SChristian BruefferNICs in 58417885a7bSLuigi Rizzo.Nm 58517885a7bSLuigi Rizzomode all share the same memory region, 58617885a7bSLuigi Rizzowhereas 58717885a7bSLuigi Rizzo.Nm VALE 58817885a7bSLuigi Rizzoports have independent regions for each port. 58917885a7bSLuigi Rizzo.It Pa nr_tx_slots , nr_rx_slots 590ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings. 59117885a7bSLuigi Rizzo.It Pa nr_tx_rings , nr_rx_rings 592ce3ee1e7SLuigi Rizzoindicate the number of transmit 593ce3ee1e7SLuigi Rizzoand receive rings. 594ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime 595*3f879a47SEnji Cooperusing interface-specific functions (e.g., 596*3f879a47SEnji Cooper.Xr ethtool 8 59717885a7bSLuigi Rizzo). 59817885a7bSLuigi Rizzo.El 59968b8534bSLuigi Rizzo.It Dv NIOCREGIF 60017885a7bSLuigi Rizzobinds the port named in 60117885a7bSLuigi Rizzo.Va nr_name 602e91d04f7SChristian Bruefferto the file descriptor. 603e91d04f7SChristian BruefferFor a physical device this also switches it into 60417885a7bSLuigi Rizzo.Nm 60517885a7bSLuigi Rizzomode, disconnecting 60617885a7bSLuigi Rizzoit from the host stack. 60717885a7bSLuigi RizzoMultiple file descriptors can be bound to the same port, 60817885a7bSLuigi Rizzowith proper synchronization left to the user. 60917885a7bSLuigi Rizzo.Pp 61037e3a6d3SLuigi RizzoThe recommended way to bind a file descriptor to a port is 61137e3a6d3SLuigi Rizzoto use function 61237e3a6d3SLuigi Rizzo.Va nm_open(..) 61337e3a6d3SLuigi Rizzo(see 614*3f879a47SEnji Cooper.Sx LIBRARIES ) 61537e3a6d3SLuigi Rizzowhich parses names to access specific port types and 61637e3a6d3SLuigi Rizzoenable features. 61737e3a6d3SLuigi RizzoIn the following we document the main features. 61837e3a6d3SLuigi Rizzo.Pp 619fa7db06bSLuigi Rizzo.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a 620fa7db06bSLuigi Rizzo.Em netmap pipe , 621fa7db06bSLuigi Rizzoconsisting of two netmap ports with a crossover connection. 622fa7db06bSLuigi RizzoA netmap pipe share the same memory space of the parent port, 623fa7db06bSLuigi Rizzoand is meant to enable configuration where a master process acts 624fa7db06bSLuigi Rizzoas a dispatcher towards slave processes. 625fa7db06bSLuigi Rizzo.Pp 626fa7db06bSLuigi RizzoTo enable this function, the 627fa7db06bSLuigi Rizzo.Pa nr_arg1 628fa7db06bSLuigi Rizzofield of the structure can be used as a hint to the kernel to 629fa7db06bSLuigi Rizzoindicate how many pipes we expect to use, and reserve extra space 630fa7db06bSLuigi Rizzoin the memory region. 631fa7db06bSLuigi Rizzo.Pp 632fa7db06bSLuigi RizzoOn return, it gives the same info as NIOCGINFO, 633fa7db06bSLuigi Rizzowith 634fa7db06bSLuigi Rizzo.Pa nr_ringid 635fa7db06bSLuigi Rizzoand 636fa7db06bSLuigi Rizzo.Pa nr_flags 637fa7db06bSLuigi Rizzoindicating the identity of the rings controlled through the file 63868b8534bSLuigi Rizzodescriptor. 63968b8534bSLuigi Rizzo.Pp 640fa7db06bSLuigi Rizzo.Va nr_flags 64117885a7bSLuigi Rizzo.Va nr_ringid 64217885a7bSLuigi Rizzoselects which rings are controlled through this file descriptor. 643fa7db06bSLuigi RizzoPossible values of 644fa7db06bSLuigi Rizzo.Pa nr_flags 645fa7db06bSLuigi Rizzoare indicated below, together with the naming schemes 646fa7db06bSLuigi Rizzothat application libraries (such as the 647fa7db06bSLuigi Rizzo.Nm nm_open 648fa7db06bSLuigi Rizzoindicated below) can use to indicate the specific set of rings. 649fa7db06bSLuigi RizzoIn the example below, "netmap:foo" is any valid netmap port name. 65068b8534bSLuigi Rizzo.Bl -tag -width XXXXX 651fa7db06bSLuigi Rizzo.It NR_REG_ALL_NIC "netmap:foo" 652fa7db06bSLuigi Rizzo(default) all hardware ring pairs 653415dfa83SMaxim Sobolev.It NR_REG_SW "netmap:foo^" 65417885a7bSLuigi Rizzothe ``host rings'', connecting to the host stack. 655d4d112e3SJoel Dahl.It NR_REG_NIC_SW "netmap:foo+" 656fa7db06bSLuigi Rizzoall hardware rings and the host rings 657fa7db06bSLuigi Rizzo.It NR_REG_ONE_NIC "netmap:foo-i" 658fa7db06bSLuigi Rizzoonly the i-th hardware ring pair, where the number is in 659fa7db06bSLuigi Rizzo.Pa nr_ringid ; 660fa7db06bSLuigi Rizzo.It NR_REG_PIPE_MASTER "netmap:foo{i" 661fa7db06bSLuigi Rizzothe master side of the netmap pipe whose identifier (i) is in 662fa7db06bSLuigi Rizzo.Pa nr_ringid ; 663fa7db06bSLuigi Rizzo.It NR_REG_PIPE_SLAVE "netmap:foo}i" 664fa7db06bSLuigi Rizzothe slave side of the netmap pipe whose identifier (i) is in 665fa7db06bSLuigi Rizzo.Pa nr_ringid . 666fa7db06bSLuigi Rizzo.Pp 667fa7db06bSLuigi RizzoThe identifier of a pipe must be thought as part of the pipe name, 668e91d04f7SChristian Bruefferand does not need to be sequential. 669e91d04f7SChristian BruefferOn return the pipe 670fa7db06bSLuigi Rizzowill only have a single ring pair with index 0, 671e91d04f7SChristian Bruefferirrespective of the value of 672e91d04f7SChristian Brueffer.Va i. 67368b8534bSLuigi Rizzo.El 67417885a7bSLuigi Rizzo.Pp 67568b8534bSLuigi RizzoBy default, a 67617885a7bSLuigi Rizzo.Xr poll 2 67768b8534bSLuigi Rizzoor 67817885a7bSLuigi Rizzo.Xr select 2 67968b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if 68068b8534bSLuigi Rizzono write events are specified. 68168b8534bSLuigi RizzoThe feature can be disabled by or-ing 682415dfa83SMaxim Sobolev.Va NETMAP_NO_TX_POLL 68317885a7bSLuigi Rizzoto the value written to 68417885a7bSLuigi Rizzo.Va nr_ringid. 68517885a7bSLuigi RizzoWhen this feature is used, 68617885a7bSLuigi Rizzopackets are transmitted only on 68717885a7bSLuigi Rizzo.Va ioctl(NIOCTXSYNC) 68817885a7bSLuigi Rizzoor select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. 689ce3ee1e7SLuigi Rizzo.Pp 690ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a 691ce3ee1e7SLuigi Rizzo.Xr vale 4 692ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default, 693ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 69468b8534bSLuigi Rizzo.It Dv NIOCTXSYNC 69568b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the 69668b8534bSLuigi Rizzonumber of slots available for transmission. 69768b8534bSLuigi Rizzo.It Dv NIOCRXSYNC 69868b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available 69968b8534bSLuigi Rizzopackets. 70068b8534bSLuigi Rizzo.El 701fa7db06bSLuigi Rizzo.Sh SELECT, POLL, EPOLL, KQUEUE. 70217885a7bSLuigi Rizzo.Xr select 2 70317885a7bSLuigi Rizzoand 70417885a7bSLuigi Rizzo.Xr poll 2 70517885a7bSLuigi Rizzoon a 70617885a7bSLuigi Rizzo.Nm 70717885a7bSLuigi Rizzofile descriptor process rings as indicated in 70817885a7bSLuigi Rizzo.Sx TRANSMIT RINGS 70917885a7bSLuigi Rizzoand 710fa7db06bSLuigi Rizzo.Sx RECEIVE RINGS , 711fa7db06bSLuigi Rizzorespectively when write (POLLOUT) and read (POLLIN) events are requested. 712fa7db06bSLuigi RizzoBoth block if no slots are available in the ring 713fa7db06bSLuigi Rizzo.Va ( ring->cur == ring->tail ) . 714fa7db06bSLuigi RizzoDepending on the platform, 715fa7db06bSLuigi Rizzo.Xr epoll 2 716fa7db06bSLuigi Rizzoand 717fa7db06bSLuigi Rizzo.Xr kqueue 2 718fa7db06bSLuigi Rizzoare supported too. 71917885a7bSLuigi Rizzo.Pp 720fa7db06bSLuigi RizzoPackets in transmit rings are normally pushed out 721fa7db06bSLuigi Rizzo(and buffers reclaimed) even without 722e91d04f7SChristian Bruefferrequesting write events. 723e91d04f7SChristian BruefferPassing the 724e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL 725e91d04f7SChristian Bruefferflag to 72617885a7bSLuigi Rizzo.Em NIOCREGIF 72717885a7bSLuigi Rizzodisables this feature. 728fa7db06bSLuigi RizzoBy default, receive rings are processed only if read 729e91d04f7SChristian Bruefferevents are requested. 730e91d04f7SChristian BruefferPassing the 731e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL 732e91d04f7SChristian Bruefferflag to 733fa7db06bSLuigi Rizzo.Em NIOCREGIF updates receive rings even without read events. 734e91d04f7SChristian BruefferNote that on epoll and kqueue, 735e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL 736e91d04f7SChristian Bruefferand 737e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL 738fa7db06bSLuigi Rizzoonly have an effect when some event is posted for the file descriptor. 73917885a7bSLuigi Rizzo.Sh LIBRARIES 74017885a7bSLuigi RizzoThe 74117885a7bSLuigi Rizzo.Nm 74217885a7bSLuigi RizzoAPI is supposed to be used directly, both because of its simplicity and 74317885a7bSLuigi Rizzofor efficient integration with applications. 74417885a7bSLuigi Rizzo.Pp 745e91d04f7SChristian BruefferFor convenience, the 746e91d04f7SChristian Brueffer.In net/netmap_user.h 74717885a7bSLuigi Rizzoheader provides a few macros and functions to ease creating 74817885a7bSLuigi Rizzoa file descriptor and doing I/O with a 74917885a7bSLuigi Rizzo.Nm 750e91d04f7SChristian Bruefferport. 751e91d04f7SChristian BruefferThese are loosely modeled after the 75217885a7bSLuigi Rizzo.Xr pcap 3 75317885a7bSLuigi RizzoAPI, to ease porting of libpcap-based applications to 75417885a7bSLuigi Rizzo.Nm . 75517885a7bSLuigi RizzoTo use these extra functions, programs should 75617885a7bSLuigi Rizzo.Dl #define NETMAP_WITH_LIBS 75717885a7bSLuigi Rizzobefore 75817885a7bSLuigi Rizzo.Dl #include <net/netmap_user.h> 75917885a7bSLuigi Rizzo.Pp 76017885a7bSLuigi RizzoThe following functions are available: 76117885a7bSLuigi Rizzo.Bl -tag -width XXXXX 762fa7db06bSLuigi Rizzo.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) 76317885a7bSLuigi Rizzosimilar to 764*3f879a47SEnji Cooper.Xr pcap_open 3pcap , 76517885a7bSLuigi Rizzobinds a file descriptor to a port. 76617885a7bSLuigi Rizzo.Bl -tag -width XX 76717885a7bSLuigi Rizzo.It Va ifname 76837e3a6d3SLuigi Rizzois a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a 76917885a7bSLuigi Rizzo.Nm VALE 77017885a7bSLuigi Rizzoport. 771fa7db06bSLuigi Rizzo.It Va req 772fa7db06bSLuigi Rizzoprovides the initial values for the argument to the NIOCREGIF ioctl. 773fa7db06bSLuigi RizzoThe nm_flags and nm_ringid values are overwritten by parsing 774fa7db06bSLuigi Rizzoifname and flags, and other fields can be overridden through 775fa7db06bSLuigi Rizzothe other two arguments. 776fa7db06bSLuigi Rizzo.It Va arg 777*3f879a47SEnji Cooperpoints to a struct nm_desc containing arguments (e.g., from a previously 778fa7db06bSLuigi Rizzoopen file descriptor) that should override the defaults. 779fa7db06bSLuigi RizzoThe fields are used as described below 78017885a7bSLuigi Rizzo.It Va flags 781fa7db06bSLuigi Rizzocan be set to a combination of the following flags: 782fa7db06bSLuigi Rizzo.Va NETMAP_NO_TX_POLL , 783fa7db06bSLuigi Rizzo.Va NETMAP_DO_RX_POLL 784fa7db06bSLuigi Rizzo(copied into nr_ringid); 785fa7db06bSLuigi Rizzo.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, 786fa7db06bSLuigi Rizzoavoids the mmap and uses the values from it); 787fa7db06bSLuigi Rizzo.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); 788fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG1 , 789fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG2 , 790fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG3 (uses the fields from arg); 791fa7db06bSLuigi Rizzo.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). 79217885a7bSLuigi Rizzo.El 793fa7db06bSLuigi Rizzo.It Va int nm_close(struct nm_desc *d) 79417885a7bSLuigi Rizzocloses the file descriptor, unmaps memory, frees resources. 795fa7db06bSLuigi Rizzo.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) 79617885a7bSLuigi Rizzosimilar to pcap_inject(), pushes a packet to a ring, returns the size 79717885a7bSLuigi Rizzoof the packet is successful, or 0 on error; 798fa7db06bSLuigi Rizzo.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) 79917885a7bSLuigi Rizzosimilar to pcap_dispatch(), applies a callback to incoming packets 800fa7db06bSLuigi Rizzo.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) 80117885a7bSLuigi Rizzosimilar to pcap_next(), fetches the next packet 80217885a7bSLuigi Rizzo.El 80317885a7bSLuigi Rizzo.Sh SUPPORTED DEVICES 80417885a7bSLuigi Rizzo.Nm 80517885a7bSLuigi Rizzonatively supports the following devices: 80617885a7bSLuigi Rizzo.Pp 80717885a7bSLuigi RizzoOn FreeBSD: 80837e3a6d3SLuigi Rizzo.Xr cxgbe 4 , 80917885a7bSLuigi Rizzo.Xr em 4 , 81017885a7bSLuigi Rizzo.Xr igb 4 , 81117885a7bSLuigi Rizzo.Xr ixgbe 4 , 81237e3a6d3SLuigi Rizzo.Xr ixl 4 , 81317885a7bSLuigi Rizzo.Xr lem 4 , 81417885a7bSLuigi Rizzo.Xr re 4 . 81517885a7bSLuigi Rizzo.Pp 81617885a7bSLuigi RizzoOn Linux 81717885a7bSLuigi Rizzo.Xr e1000 4 , 81817885a7bSLuigi Rizzo.Xr e1000e 4 , 81937e3a6d3SLuigi Rizzo.Xr i40e 4 , 82017885a7bSLuigi Rizzo.Xr igb 4 , 82117885a7bSLuigi Rizzo.Xr ixgbe 4 , 82217885a7bSLuigi Rizzo.Xr r8169 4 . 82317885a7bSLuigi Rizzo.Pp 82417885a7bSLuigi RizzoNICs without native support can still be used in 82517885a7bSLuigi Rizzo.Nm 826e91d04f7SChristian Brueffermode through emulation. 827e91d04f7SChristian BruefferPerformance is inferior to native netmap 82837e3a6d3SLuigi Rizzomode but still significantly higher than various raw socket types 82937e3a6d3SLuigi Rizzo(bpf, PF_PACKET, etc.). 83037e3a6d3SLuigi RizzoNote that for slow devices (such as 1 Gbit/s and slower NICs, 831fbb9e715SLuigi Rizzoor several 10 Gbit/s NICs whose hardware is unable to sustain line rate), 832fbb9e715SLuigi Rizzoemulated and native mode will likely have similar or same throughput. 833*3f879a47SEnji Cooper.Pp 83437e3a6d3SLuigi RizzoWhen emulation is in use, packet sniffer programs such as tcpdump 835*3f879a47SEnji Coopercould see received packets before they are diverted by netmap. 836*3f879a47SEnji CooperThis behaviour is not intentional, being just an artifact of the implementation 837*3f879a47SEnji Cooperof emulation. 83837e3a6d3SLuigi RizzoNote that in case the netmap application subsequently moves packets received 83937e3a6d3SLuigi Rizzofrom the emulated adapter onto the host RX ring, the sniffer will intercept 84037e3a6d3SLuigi Rizzothose packets again, since the packets are injected to the host stack as they 84137e3a6d3SLuigi Rizzowere received by the network interface. 84217885a7bSLuigi Rizzo.Pp 84317885a7bSLuigi RizzoEmulation is also available for devices with native netmap support, 84417885a7bSLuigi Rizzowhich can be used for testing or performance comparison. 84517885a7bSLuigi RizzoThe sysctl variable 84617885a7bSLuigi Rizzo.Va dev.netmap.admode 84717885a7bSLuigi Rizzoglobally controls how netmap mode is implemented. 84817885a7bSLuigi Rizzo.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 84917885a7bSLuigi RizzoSome aspect of the operation of 85017885a7bSLuigi Rizzo.Nm 85117885a7bSLuigi Rizzoare controlled through sysctl variables on FreeBSD 85217885a7bSLuigi Rizzo.Em ( dev.netmap.* ) 85317885a7bSLuigi Rizzoand module parameters on Linux 85417885a7bSLuigi Rizzo.Em ( /sys/module/netmap_lin/parameters/* ) : 85517885a7bSLuigi Rizzo.Bl -tag -width indent 85617885a7bSLuigi Rizzo.It Va dev.netmap.admode: 0 85717885a7bSLuigi RizzoControls the use of native or emulated adapter mode. 858*3f879a47SEnji Cooper.Pp 85937e3a6d3SLuigi Rizzo0 uses the best available option; 860*3f879a47SEnji Cooper.Pp 86137e3a6d3SLuigi Rizzo1 forces native mode and fails if not available; 862*3f879a47SEnji Cooper.Pp 86337e3a6d3SLuigi Rizzo2 forces emulated hence never fails. 86417885a7bSLuigi Rizzo.It Va dev.netmap.generic_ringsize: 1024 86517885a7bSLuigi RizzoRing size used for emulated netmap mode 86617885a7bSLuigi Rizzo.It Va dev.netmap.generic_mit: 100000 86717885a7bSLuigi RizzoControls interrupt moderation for emulated mode 86817885a7bSLuigi Rizzo.It Va dev.netmap.mmap_unreg: 0 86917885a7bSLuigi Rizzo.It Va dev.netmap.fwd: 0 87017885a7bSLuigi RizzoForces NS_FORWARD mode 87117885a7bSLuigi Rizzo.It Va dev.netmap.flags: 0 87217885a7bSLuigi Rizzo.It Va dev.netmap.txsync_retry: 2 87317885a7bSLuigi Rizzo.It Va dev.netmap.no_pendintr: 1 87417885a7bSLuigi RizzoForces recovery of transmit buffers on system calls 87517885a7bSLuigi Rizzo.It Va dev.netmap.mitigate: 1 87617885a7bSLuigi RizzoPropagates interrupt mitigation to user processes 87717885a7bSLuigi Rizzo.It Va dev.netmap.no_timestamp: 0 87817885a7bSLuigi RizzoDisables the update of the timestamp in the netmap ring 87917885a7bSLuigi Rizzo.It Va dev.netmap.verbose: 0 88017885a7bSLuigi RizzoVerbose kernel messages 88117885a7bSLuigi Rizzo.It Va dev.netmap.buf_num: 163840 88217885a7bSLuigi Rizzo.It Va dev.netmap.buf_size: 2048 88317885a7bSLuigi Rizzo.It Va dev.netmap.ring_num: 200 88417885a7bSLuigi Rizzo.It Va dev.netmap.ring_size: 36864 88517885a7bSLuigi Rizzo.It Va dev.netmap.if_num: 100 88617885a7bSLuigi Rizzo.It Va dev.netmap.if_size: 1024 88717885a7bSLuigi RizzoSizes and number of objects (netmap_if, netmap_ring, buffers) 888e91d04f7SChristian Bruefferfor the global memory region. 889e91d04f7SChristian BruefferThe only parameter worth modifying is 89017885a7bSLuigi Rizzo.Va dev.netmap.buf_num 89117885a7bSLuigi Rizzoas it impacts the total amount of memory used by netmap. 89217885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_num: 0 89317885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_size: 0 89417885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_num: 0 89517885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_size: 0 89617885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_num: 0 89717885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_size: 0 89817885a7bSLuigi RizzoActual values in use. 89917885a7bSLuigi Rizzo.It Va dev.netmap.bridge_batch: 1024 90017885a7bSLuigi RizzoBatch size used when moving packets across a 90117885a7bSLuigi Rizzo.Nm VALE 902e91d04f7SChristian Bruefferswitch. 903e91d04f7SChristian BruefferValues above 64 generally guarantee good 90417885a7bSLuigi Rizzoperformance. 90517885a7bSLuigi Rizzo.El 90613a5d88fSLuigi Rizzo.Sh SYSTEM CALLS 90768b8534bSLuigi Rizzo.Nm 90868b8534bSLuigi Rizzouses 909fa7db06bSLuigi Rizzo.Xr select 2 , 910fa7db06bSLuigi Rizzo.Xr poll 2 , 91137e3a6d3SLuigi Rizzo.Xr epoll 2 91268b8534bSLuigi Rizzoand 91337e3a6d3SLuigi Rizzo.Xr kqueue 2 914ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and 915ce3ee1e7SLuigi Rizzo.Xr mmap 2 916ce3ee1e7SLuigi Rizzoto map memory. 91717885a7bSLuigi Rizzo.Xr ioctl 2 91817885a7bSLuigi Rizzois used to configure ports and 91917885a7bSLuigi Rizzo.Nm VALE switches . 920ce3ee1e7SLuigi Rizzo.Pp 921ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to 922ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard 923ce3ee1e7SLuigi RizzoOS primitives, see 924ce3ee1e7SLuigi Rizzo.Xr pthread 3 . 925ce3ee1e7SLuigi RizzoIn particular, 926ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3 927ce3ee1e7SLuigi Rizzomay be of use. 92868b8534bSLuigi Rizzo.Sh EXAMPLES 92917885a7bSLuigi Rizzo.Ss TEST PROGRAMS 93017885a7bSLuigi Rizzo.Nm 93117885a7bSLuigi Rizzocomes with a few programs that can be used for testing or 93217885a7bSLuigi Rizzosimple applications. 93317885a7bSLuigi RizzoSee the 934e91d04f7SChristian Brueffer.Pa examples/ 93517885a7bSLuigi Rizzodirectory in 93617885a7bSLuigi Rizzo.Nm 93717885a7bSLuigi Rizzodistributions, or 938e91d04f7SChristian Brueffer.Pa tools/tools/netmap/ 939e91d04f7SChristian Bruefferdirectory in 940e91d04f7SChristian Brueffer.Fx 941e91d04f7SChristian Bruefferdistributions. 94217885a7bSLuigi Rizzo.Pp 943*3f879a47SEnji Cooper.Xr pkt-gen 8 94417885a7bSLuigi Rizzois a general purpose traffic source/sink. 94517885a7bSLuigi Rizzo.Pp 94617885a7bSLuigi RizzoAs an example 94717885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f tx -l 60 94817885a7bSLuigi Rizzocan generate an infinite stream of minimum size packets, and 94917885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f rx 95017885a7bSLuigi Rizzois a traffic sink. 95117885a7bSLuigi RizzoBoth print traffic statistics, to help monitor 95217885a7bSLuigi Rizzohow the system performs. 95317885a7bSLuigi Rizzo.Pp 954*3f879a47SEnji Cooper.Xr pkt-gen 8 95517885a7bSLuigi Rizzohas many options can be uses to set packet sizes, addresses, 95617885a7bSLuigi Rizzorates, and use multiple send/receive threads and cores. 95717885a7bSLuigi Rizzo.Pp 958*3f879a47SEnji Cooper.Xr bridge 4 95917885a7bSLuigi Rizzois another test program which interconnects two 96017885a7bSLuigi Rizzo.Nm 961e91d04f7SChristian Bruefferports. 962e91d04f7SChristian BruefferIt can be used for transparent forwarding between 96317885a7bSLuigi Rizzointerfaces, as in 96417885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix1 96517885a7bSLuigi Rizzoor even connect the NIC to the host stack using netmap 96617885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix0 96717885a7bSLuigi Rizzo.Ss USING THE NATIVE API 96868b8534bSLuigi RizzoThe following code implements a traffic generator 96968b8534bSLuigi Rizzo.Pp 97068b8534bSLuigi Rizzo.Bd -literal -compact 97168b8534bSLuigi Rizzo#include <net/netmap_user.h> 972fe1e4a6cSBaptiste Daroussin\&... 97317885a7bSLuigi Rizzovoid sender(void) 97417885a7bSLuigi Rizzo{ 97568b8534bSLuigi Rizzo struct netmap_if *nifp; 97668b8534bSLuigi Rizzo struct netmap_ring *ring; 977d83a410eSHiren Panchasara struct nmreq nmr; 97817885a7bSLuigi Rizzo struct pollfd fds; 97968b8534bSLuigi Rizzo 98068b8534bSLuigi Rizzo fd = open("/dev/netmap", O_RDWR); 98168b8534bSLuigi Rizzo bzero(&nmr, sizeof(nmr)); 982d83a410eSHiren Panchasara strcpy(nmr.nr_name, "ix0"); 983ce3ee1e7SLuigi Rizzo nmr.nm_version = NETMAP_API; 984ce3ee1e7SLuigi Rizzo ioctl(fd, NIOCREGIF, &nmr); 985d83a410eSHiren Panchasara p = mmap(0, nmr.nr_memsize, fd); 986ce3ee1e7SLuigi Rizzo nifp = NETMAP_IF(p, nmr.nr_offset); 98768b8534bSLuigi Rizzo ring = NETMAP_TXRING(nifp, 0); 98868b8534bSLuigi Rizzo fds.fd = fd; 98968b8534bSLuigi Rizzo fds.events = POLLOUT; 99068b8534bSLuigi Rizzo for (;;) { 99117885a7bSLuigi Rizzo poll(&fds, 1, -1); 99217885a7bSLuigi Rizzo while (!nm_ring_empty(ring)) { 99368b8534bSLuigi Rizzo i = ring->cur; 99468b8534bSLuigi Rizzo buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 99568b8534bSLuigi Rizzo ... prepare packet in buf ... 99668b8534bSLuigi Rizzo ring->slot[i].len = ... packet length ... 99717885a7bSLuigi Rizzo ring->head = ring->cur = nm_ring_next(ring, i); 99817885a7bSLuigi Rizzo } 99968b8534bSLuigi Rizzo } 100068b8534bSLuigi Rizzo} 100168b8534bSLuigi Rizzo.Ed 100217885a7bSLuigi Rizzo.Ss HELPER FUNCTIONS 100317885a7bSLuigi RizzoA simple receiver can be implemented using the helper functions 100417885a7bSLuigi Rizzo.Bd -literal -compact 100517885a7bSLuigi Rizzo#define NETMAP_WITH_LIBS 100617885a7bSLuigi Rizzo#include <net/netmap_user.h> 1007fe1e4a6cSBaptiste Daroussin\&... 100817885a7bSLuigi Rizzovoid receiver(void) 100917885a7bSLuigi Rizzo{ 1010fa7db06bSLuigi Rizzo struct nm_desc *d; 101117885a7bSLuigi Rizzo struct pollfd fds; 101217885a7bSLuigi Rizzo u_char *buf; 1013fa7db06bSLuigi Rizzo struct nm_pkthdr h; 101417885a7bSLuigi Rizzo ... 101517885a7bSLuigi Rizzo d = nm_open("netmap:ix0", NULL, 0, 0); 101617885a7bSLuigi Rizzo fds.fd = NETMAP_FD(d); 101717885a7bSLuigi Rizzo fds.events = POLLIN; 101817885a7bSLuigi Rizzo for (;;) { 101917885a7bSLuigi Rizzo poll(&fds, 1, -1); 102017885a7bSLuigi Rizzo while ( (buf = nm_nextpkt(d, &h)) ) 102117885a7bSLuigi Rizzo consume_pkt(buf, h->len); 102217885a7bSLuigi Rizzo } 102317885a7bSLuigi Rizzo nm_close(d); 102417885a7bSLuigi Rizzo} 102517885a7bSLuigi Rizzo.Ed 102617885a7bSLuigi Rizzo.Ss ZERO-COPY FORWARDING 102717885a7bSLuigi RizzoSince physical interfaces share the same memory region, 102817885a7bSLuigi Rizzoit is possible to do packet forwarding between ports 1029e91d04f7SChristian Bruefferswapping buffers. 1030e91d04f7SChristian BruefferThe buffer from the transmit ring is used 103117885a7bSLuigi Rizzoto replenish the receive ring: 103217885a7bSLuigi Rizzo.Bd -literal -compact 103317885a7bSLuigi Rizzo uint32_t tmp; 103417885a7bSLuigi Rizzo struct netmap_slot *src, *dst; 103517885a7bSLuigi Rizzo ... 103617885a7bSLuigi Rizzo src = &src_ring->slot[rxr->cur]; 103717885a7bSLuigi Rizzo dst = &dst_ring->slot[txr->cur]; 103817885a7bSLuigi Rizzo tmp = dst->buf_idx; 103917885a7bSLuigi Rizzo dst->buf_idx = src->buf_idx; 104017885a7bSLuigi Rizzo dst->len = src->len; 104117885a7bSLuigi Rizzo dst->flags = NS_BUF_CHANGED; 104217885a7bSLuigi Rizzo src->buf_idx = tmp; 104317885a7bSLuigi Rizzo src->flags = NS_BUF_CHANGED; 104417885a7bSLuigi Rizzo rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 104517885a7bSLuigi Rizzo txr->head = txr->cur = nm_ring_next(txr, txr->cur); 104617885a7bSLuigi Rizzo ... 104717885a7bSLuigi Rizzo.Ed 104817885a7bSLuigi Rizzo.Ss ACCESSING THE HOST STACK 1049fa7db06bSLuigi RizzoThe host stack is for all practical purposes just a regular ring pair, 1050*3f879a47SEnji Cooperwhich you can access with the netmap API (e.g., with 1051fa7db06bSLuigi Rizzo.Dl nm_open("netmap:eth0^", ... ) ; 1052fa7db06bSLuigi RizzoAll packets that the host would send to an interface in 1053fa7db06bSLuigi Rizzo.Nm 1054fa7db06bSLuigi Rizzomode end up into the RX ring, whereas all packets queued to the 1055fa7db06bSLuigi RizzoTX ring are send up to the host stack. 105617885a7bSLuigi Rizzo.Ss VALE SWITCH 105717885a7bSLuigi RizzoA simple way to test the performance of a 105817885a7bSLuigi Rizzo.Nm VALE 105917885a7bSLuigi Rizzoswitch is to attach a sender and a receiver to it, 1060*3f879a47SEnji Coopere.g., running the following in two different terminals: 106117885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:a -f rx # receiver 106217885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:b -f tx # sender 1063fa7db06bSLuigi RizzoThe same example can be used to test netmap pipes, by simply 1064*3f879a47SEnji Cooperchanging port names, e.g., 106537e3a6d3SLuigi Rizzo.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side 106637e3a6d3SLuigi Rizzo.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side 106717885a7bSLuigi Rizzo.Pp 106817885a7bSLuigi RizzoThe following command attaches an interface and the host stack 106917885a7bSLuigi Rizzoto a switch: 107017885a7bSLuigi Rizzo.Dl vale-ctl -h vale2:em0 107117885a7bSLuigi RizzoOther 107268b8534bSLuigi Rizzo.Nm 107317885a7bSLuigi Rizzoclients attached to the same switch can now communicate 107417885a7bSLuigi Rizzowith the network card or the host. 107513a5d88fSLuigi Rizzo.Sh SEE ALSO 10760b3504fdSChristian Brueffer.Pa http://info.iet.unipi.it/~luigi/netmap/ 107713a5d88fSLuigi Rizzo.Pp 107813a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework, 107913a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012 108013a5d88fSLuigi Rizzo.Pp 108113a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O, 108213a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston 1083fa7db06bSLuigi Rizzo.Pp 1084fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri, 1085fa7db06bSLuigi RizzoVALE, a switched ethernet for virtual machines, 1086fa7db06bSLuigi RizzoACM CoNEXT'12, December 2012, Nice 1087fa7db06bSLuigi Rizzo.Pp 1088fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, 1089fa7db06bSLuigi RizzoSpeeding up packet I/O in virtual machines, 1090fa7db06bSLuigi RizzoACM/IEEE ANCS'13, October 2013, San Jose 109168b8534bSLuigi Rizzo.Sh AUTHORS 109213a5d88fSLuigi Rizzo.An -nosplit 109368b8534bSLuigi RizzoThe 109468b8534bSLuigi Rizzo.Nm 1095ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the 109613a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by 109713a5d88fSLuigi Rizzo.An Luigi Rizzo , 1098ce3ee1e7SLuigi Rizzoand further extended with help from 109913a5d88fSLuigi Rizzo.An Matteo Landi , 110013a5d88fSLuigi Rizzo.An Gaetano Catalli , 1101ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri , 1102e91d04f7SChristian Bruefferand 1103ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione . 110413a5d88fSLuigi Rizzo.Pp 110513a5d88fSLuigi Rizzo.Nm 1106ce3ee1e7SLuigi Rizzoand 1107ce3ee1e7SLuigi Rizzo.Nm VALE 1108ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects 1109ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581). 1110bf15fc88SJoel Dahl.Sh CAVEATS 1111bf15fc88SJoel DahlNo matter how fast the CPU and OS are, 1112bf15fc88SJoel Dahlachieving line rate on 10G and faster interfaces 1113bf15fc88SJoel Dahlrequires hardware with sufficient performance. 1114bf15fc88SJoel DahlSeveral NICs are unable to sustain line rate with 1115e91d04f7SChristian Brueffersmall packet sizes. 1116e91d04f7SChristian BruefferInsufficient PCIe or memory bandwidth 1117bf15fc88SJoel Dahlcan also cause reduced performance. 1118bf15fc88SJoel Dahl.Pp 1119bf15fc88SJoel DahlAnother frequent reason for low performance is the use 1120bf15fc88SJoel Dahlof flow control on the link: a slow receiver can limit 1121bf15fc88SJoel Dahlthe transmit speed. 1122bf15fc88SJoel DahlBe sure to disable flow control when running high 1123bf15fc88SJoel Dahlspeed experiments. 1124bf15fc88SJoel Dahl.Ss SPECIAL NIC FEATURES 1125bf15fc88SJoel Dahl.Nm 1126bf15fc88SJoel Dahlis orthogonal to some NIC features such as 1127bf15fc88SJoel Dahlmultiqueue, schedulers, packet filters. 1128bf15fc88SJoel Dahl.Pp 1129bf15fc88SJoel DahlMultiple transmit and receive rings are supported natively 1130bf15fc88SJoel Dahland can be configured with ordinary OS tools, 1131bf15fc88SJoel Dahlsuch as 1132*3f879a47SEnji Cooper.Xr ethtool 8 1133bf15fc88SJoel Dahlor 1134bf15fc88SJoel Dahldevice-specific sysctl variables. 1135bf15fc88SJoel DahlThe same goes for Receive Packet Steering (RPS) 1136bf15fc88SJoel Dahland filtering of incoming traffic. 1137bf15fc88SJoel Dahl.Pp 1138bf15fc88SJoel Dahl.Nm 1139bf15fc88SJoel Dahl.Em does not use 1140bf15fc88SJoel Dahlfeatures such as 1141bf15fc88SJoel Dahl.Em checksum offloading , TCP segmentation offloading , 1142bf15fc88SJoel Dahl.Em encryption , VLAN encapsulation/decapsulation , 1143e91d04f7SChristian Bruefferetc. 1144bf15fc88SJoel DahlWhen using netmap to exchange packets with the host stack, 1145bf15fc88SJoel Dahlmake sure to disable these features. 1146