1.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd October 18, 2013 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.Sh SYNOPSIS 37.Cd device netmap 38.Sh DESCRIPTION 39.Nm 40is a framework for extremely fast and efficient packet I/O 41(reaching 14.88 Mpps with a single core at less than 1 GHz) 42for both userspace and kernel clients. 43Userspace clients can use the netmap API 44to send and receive raw packets through physical interfaces 45or ports of the 46.Xr VALE 4 47switch. 48.Pp 49.Nm VALE 50is a very fast (reaching 20 Mpps per port) 51and modular software switch, 52implemented within the kernel, which can interconnect 53virtual ports, physical devices, and the native host stack. 54.Pp 55.Nm 56uses a memory mapped region to share packet buffers, 57descriptors and queues with the kernel. 58Simple 59.Pa ioctl()s 60are used to bind interfaces/ports to file descriptors and 61implement non-blocking I/O, whereas blocking I/O uses 62.Pa select()/poll() . 63.Nm 64can exploit the parallelism in multiqueue devices and 65multicore systems. 66.Pp 67For the best performance, 68.Nm 69requires explicit support in device drivers; 70a generic emulation layer is available to implement the 71.Nm 72API on top of unmodified device drivers, 73at the price of reduced performance 74(but still better than what can be achieved with 75sockets or BPF/pcap). 76.Pp 77For a list of devices with native 78.Nm 79support, see the end of this manual page. 80.Pp 81.Sh OPERATION - THE NETMAP API 82.Nm 83clients must first 84.Pa open("/dev/netmap") , 85and then issue an 86.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg) 87to bind the file descriptor to a specific interface or port. 88.Nm 89has multiple modes of operation controlled by the 90content of the 91.Pa struct nmreq 92passed to the 93.Pa ioctl() . 94In particular, the 95.Em nr_name 96field specifies whether the client operates on a physical network 97interface or on a port of a 98.Nm VALE 99switch, as indicated below. Additional fields in the 100.Pa struct nmreq 101control the details of operation. 102.Pp 103.Bl -tag -width XXXX 104.It Dv Interface name (e.g. 'em0', 'eth1', ... ) 105The data path of the interface is disconnected from the host stack. 106Depending on additional arguments, 107the file descriptor is bound to the NIC (one or all queues), 108or to the host stack. 109.It Dv valeXXX:YYY (arbitrary XXX and YYY) 110The file descriptor is bound to port YYY of a VALE switch called XXX, 111where XXX and YYY are arbitrary alphanumeric strings. 112The string cannot exceed IFNAMSIZ characters, and YYY cannot 113matching the name of any existing interface. 114.Pp 115The switch and the port are created if not existing. 116.It Dv valeXXX:ifname (ifname is an existing interface) 117Flags in the argument control whether the physical interface 118(and optionally the corrisponding host stack endpoint) 119are connected or disconnected from the VALE switch named XXX. 120.Pp 121In this case the 122.Pa ioctl() 123is used only for configuring the VALE switch, typically through the 124.Nm vale-ctl 125command. 126The file descriptor cannot be used for I/O, and should be 127.Pa close()d 128after issuing the 129.Pa ioctl(). 130.El 131.Pp 132The binding can be removed (and the interface returns to 133regular operation, or the virtual port destroyed) with a 134.Pa close() 135on the file descriptor. 136.Pp 137The processes owning the file descriptor can then 138.Pa mmap() 139the memory region that contains pre-allocated 140buffers, descriptors and queues, and use them to 141read/write raw packets. 142Non blocking I/O is done with special 143.Pa ioctl()'s , 144whereas the file descriptor can be passed to 145.Pa select()/poll() 146to be notified about incoming packet or available transmit buffers. 147.Ss DATA STRUCTURES 148The data structures in the mmapped memory are described below 149(see 150.Xr sys/net/netmap.h 151for reference). 152All physical devices operating in 153.Nm 154mode use the same memory region, 155shared by the kernel and all processes who own 156.Pa /dev/netmap 157descriptors bound to those devices 158(NOTE: visibility may be restricted in future implementations). 159Virtual ports instead use separate memory regions, 160shared only with the kernel. 161.Pp 162All references between the shared data structure 163are relative (offsets or indexes). Some macros help converting 164them into actual pointers. 165.Bl -tag -width XXX 166.It Dv struct netmap_if (one per interface) 167indicates the number of rings supported by an interface, their 168sizes, and the offsets of the 169.Pa netmap_rings 170associated to the interface. 171.Pp 172.Pa struct netmap_if 173is at offset 174.Pa nr_offset 175in the shared memory region is indicated by the 176field in the structure returned by the 177.Pa NIOCREGIF 178(see below). 179.Bd -literal 180struct netmap_if { 181 char ni_name[IFNAMSIZ]; /* name of the interface. */ 182 const u_int ni_version; /* API version */ 183 const u_int ni_rx_rings; /* number of rx ring pairs */ 184 const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */ 185 const ssize_t ring_ofs[]; /* offset of tx and rx rings */ 186}; 187.Ed 188.It Dv struct netmap_ring (one per ring) 189Contains the positions in the transmit and receive rings to 190synchronize the kernel and the application, 191and an array of 192.Pa slots 193describing the buffers. 194'reserved' is used in receive rings to tell the kernel the 195number of slots after 'cur' that are still in usr 196indicates how many slots starting from 'cur' 197the 198.Pp 199Each physical interface has one 200.Pa netmap_ring 201for each hardware transmit and receive ring, 202plus one extra transmit and one receive structure 203that connect to the host stack. 204.Bd -literal 205struct netmap_ring { 206 const ssize_t buf_ofs; /* see details */ 207 const uint32_t num_slots; /* number of slots in the ring */ 208 uint32_t avail; /* number of usable slots */ 209 uint32_t cur; /* 'current' read/write index */ 210 uint32_t reserved; /* not refilled before current */ 211 212 const uint16_t nr_buf_size; 213 uint16_t flags; 214#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */ 215#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */ 216#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */ 217 struct timeval ts; 218 struct netmap_slot slot[0]; /* array of slots */ 219} 220.Ed 221.Pp 222In transmit rings, after a system call 'cur' indicates 223the first slot that can be used for transmissions, 224and 'avail' reports how many of them are available. 225Before the next netmap-related system call on the file 226descriptor, the application should fill buffers and 227slots with data, and update 'cur' and 'avail' 228accordingly, as shown in the figure below: 229.Bd -literal 230 231 cur 232 |----- avail ---| (after syscall) 233 v 234 TX [*****aaaaaaaaaaaaaaaaa**] 235 TX [*****TTTTTaaaaaaaaaaaa**] 236 ^ 237 |-- avail --| (before syscall) 238 cur 239.Ed 240 241In receive rings, after a system call 'cur' indicates 242the first slot that contains a valid packet, 243and 'avail' reports how many of them are available. 244Before the next netmap-related system call on the file 245descriptor, the application can process buffers and 246release them to the kernel updating 247'cur' and 'avail' accordingly, as shown in the figure below. 248Receive rings have an additional field called 'reserved' 249to indicate how many buffers before 'cur' are still 250under processing and cannot be released. 251.Bd -literal 252 cur 253 |-res-|-- avail --| (after syscall) 254 v 255 RX [**rrrrrrRRRRRRRRRRRR******] 256 RX [**...........rrrrRRR******] 257 |res|--|<avail (before syscall) 258 ^ 259 cur 260 261.Ed 262.It Dv struct netmap_slot (one per packet) 263contains the metadata for a packet: 264.Bd -literal 265struct netmap_slot { 266 uint32_t buf_idx; /* buffer index */ 267 uint16_t len; /* packet length */ 268 uint16_t flags; /* buf changed, etc. */ 269#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */ 270#define NS_REPORT 0x0002 /* tell hw to report results 271 * e.g. by generating an interrupt 272 */ 273#define NS_FORWARD 0x0004 /* pass packet to the other endpoint 274 * (host stack or device) 275 */ 276#define NS_NO_LEARN 0x0008 277#define NS_INDIRECT 0x0010 278#define NS_MOREFRAG 0x0020 279#define NS_PORT_SHIFT 8 280#define NS_PORT_MASK (0xff << NS_PORT_SHIFT) 281#define NS_RFRAGS(_slot) ( ((_slot)->flags >> 8) & 0xff) 282 uint64_t ptr; /* buffer address (indirect buffers) */ 283}; 284.Ed 285The flags control how the the buffer associated to the slot 286should be managed. 287.It Dv packet buffers 288are normally fixed size (2 Kbyte) buffers allocated by the kernel 289that contain packet data. Buffers addresses are computed through 290macros. 291.El 292.Pp 293.Bl -tag -width XXX 294Some macros support the access to objects in the shared memory 295region. In particular, 296.It NETMAP_TXRING(nifp, i) 297.It NETMAP_RXRING(nifp, i) 298return the address of the i-th transmit and receive ring, 299respectively, whereas 300.It NETMAP_BUF(ring, buf_idx) 301returns the address of the buffer with index buf_idx 302(which can be part of any ring for the given interface). 303.El 304.Pp 305Normally, buffers are associated to slots when interfaces are bound, 306and one packet is fully contained in a single buffer. 307Clients can however modify the mapping using the 308following flags: 309.Ss FLAGS 310.Bl -tag -width XXX 311.It NS_BUF_CHANGED 312indicates that the buf_idx in the slot has changed. 313This can be useful if the client wants to implement 314some form of zero-copy forwarding (e.g. by passing buffers 315from an input interface to an output interface), or 316needs to process packets out of order. 317.Pp 318The flag MUST be used whenever the buffer index is changed. 319.It NS_REPORT 320indicates that we want to be woken up when this buffer 321has been transmitted. This reduces performance but insures 322a prompt notification when a buffer has been sent. 323Normally, 324.Nm 325notifies transmit completions in batches, hence signals 326can be delayed indefinitely. However, we need such notifications 327before closing a descriptor. 328.It NS_FORWARD 329When the device is open in 'transparent' mode, 330the client can mark slots in receive rings with this flag. 331For all marked slots, marked packets are forwarded to 332the other endpoint at the next system call, thus restoring 333(in a selective way) the connection between the NIC and the 334host stack. 335.It NS_NO_LEARN 336tells the forwarding code that the SRC MAC address for this 337packet should not be used in the learning bridge 338.It NS_INDIRECT 339indicates that the packet's payload is not in the netmap 340supplied buffer, but in a user-supplied buffer whose 341user virtual address is in the 'ptr' field of the slot. 342The size can reach 65535 bytes. 343.Em This is only supported on the transmit ring of virtual ports 344.It NS_MOREFRAG 345indicates that the packet continues with subsequent buffers; 346the last buffer in a packet must have the flag clear. 347The maximum length of a chain is 64 buffers. 348.Em This is only supported on virtual ports 349.It NS_RFRAGS(slot) 350on receive rings, returns the number of remaining buffers 351in a packet, including this one. 352Slots with a value greater than 1 also have NS_MOREFRAG set. 353The length refers to the individual buffer, there is no 354field for the total length. 355.Pp 356On transmit rings, if NS_DST is set, it is passed to the lookup 357function, which can use it e.g. as the index of the destination 358port instead of doing an address lookup. 359.El 360.Sh IOCTLS 361.Nm 362supports some ioctl() to synchronize the state of the rings 363between the kernel and the user processes, plus some 364to query and configure the interface. 365The former do not require any argument, whereas the latter 366use a 367.Pa struct nmreq 368defined as follows: 369.Bd -literal 370struct nmreq { 371 char nr_name[IFNAMSIZ]; 372 uint32_t nr_version; /* API version */ 373#define NETMAP_API 4 /* current version */ 374 uint32_t nr_offset; /* nifp offset in the shared region */ 375 uint32_t nr_memsize; /* size of the shared region */ 376 uint32_t nr_tx_slots; /* slots in tx rings */ 377 uint32_t nr_rx_slots; /* slots in rx rings */ 378 uint16_t nr_tx_rings; /* number of tx rings */ 379 uint16_t nr_rx_rings; /* number of tx rings */ 380 uint16_t nr_ringid; /* ring(s) we care about */ 381#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */ 382#define NETMAP_SW_RING 0x2000 /* we process the sw ring */ 383#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ 384#define NETMAP_RING_MASK 0xfff /* the actual ring number */ 385 uint16_t nr_cmd; 386#define NETMAP_BDG_ATTACH 1 /* attach the NIC */ 387#define NETMAP_BDG_DETACH 2 /* detach the NIC */ 388#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */ 389#define NETMAP_BDG_LIST 4 /* get bridge's info */ 390 uint16_t nr_arg1; 391 uint16_t nr_arg2; 392 uint32_t spare2[3]; 393}; 394 395.Ed 396A device descriptor obtained through 397.Pa /dev/netmap 398also supports the ioctl supported by network devices. 399.Pp 400The netmap-specific 401.Xr ioctl 2 402command codes below are defined in 403.In net/netmap.h 404and are: 405.Bl -tag -width XXXX 406.It Dv NIOCGINFO 407returns EINVAL if the named device does not support netmap. 408Otherwise, it returns 0 and (advisory) information 409about the interface. 410Note that all the information below can change before the 411interface is actually put in netmap mode. 412.Pp 413.Pa nr_memsize 414indicates the size of the netmap 415memory region. Physical devices all share the same memory region, 416whereas VALE ports may have independent regions for each port. 417These sizes can be set through system-wise sysctl variables. 418.Pa nr_tx_slots, nr_rx_slots 419indicate the size of transmit and receive rings. 420.Pa nr_tx_rings, nr_rx_rings 421indicate the number of transmit 422and receive rings. 423Both ring number and sizes may be configured at runtime 424using interface-specific functions (e.g. 425.Pa sysctl 426or 427.Pa ethtool . 428.It Dv NIOCREGIF 429puts the interface named in nr_name into netmap mode, disconnecting 430it from the host stack, and/or defines which rings are controlled 431through this file descriptor. 432On return, it gives the same info as NIOCGINFO, and nr_ringid 433indicates the identity of the rings controlled through the file 434descriptor. 435.Pp 436Possible values for nr_ringid are 437.Bl -tag -width XXXXX 438.It 0 439default, all hardware rings 440.It NETMAP_SW_RING 441the ``host rings'' connecting to the host stack 442.It NETMAP_HW_RING + i 443the i-th hardware ring 444.El 445By default, a 446.Nm poll 447or 448.Nm select 449call pushes out any pending packets on the transmit ring, even if 450no write events are specified. 451The feature can be disabled by or-ing 452.Nm NETMAP_NO_TX_SYNC 453to nr_ringid. 454But normally you should keep this feature unless you are using 455separate file descriptors for the send and receive rings, because 456otherwise packets are pushed out only if NETMAP_TXSYNC is called, 457or the send queue is full. 458.Pp 459.Pa NIOCREGIF 460can be used multiple times to change the association of a 461file descriptor to a ring pair, always within the same device. 462.Pp 463When registering a virtual interface that is dynamically created to a 464.Xr vale 4 465switch, we can specify the desired number of rings (1 by default, 466and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 467.It Dv NIOCTXSYNC 468tells the hardware of new packets to transmit, and updates the 469number of slots available for transmission. 470.It Dv NIOCRXSYNC 471tells the hardware of consumed packets, and asks for newly available 472packets. 473.El 474.Sh SYSTEM CALLS 475.Nm 476uses 477.Xr select 2 478and 479.Xr poll 2 480to wake up processes when significant events occur, and 481.Xr mmap 2 482to map memory. 483.Pp 484Applications may need to create threads and bind them to 485specific cores to improve performance, using standard 486OS primitives, see 487.Xr pthread 3 . 488In particular, 489.Xr pthread_setaffinity_np 3 490may be of use. 491.Sh EXAMPLES 492The following code implements a traffic generator 493.Pp 494.Bd -literal -compact 495#include <net/netmap.h> 496#include <net/netmap_user.h> 497struct netmap_if *nifp; 498struct netmap_ring *ring; 499struct nmreq nmr; 500 501fd = open("/dev/netmap", O_RDWR); 502bzero(&nmr, sizeof(nmr)); 503strcpy(nmr.nr_name, "ix0"); 504nmr.nm_version = NETMAP_API; 505ioctl(fd, NIOCREGIF, &nmr); 506p = mmap(0, nmr.nr_memsize, fd); 507nifp = NETMAP_IF(p, nmr.nr_offset); 508ring = NETMAP_TXRING(nifp, 0); 509fds.fd = fd; 510fds.events = POLLOUT; 511for (;;) { 512 poll(list, 1, -1); 513 for ( ; ring->avail > 0 ; ring->avail--) { 514 i = ring->cur; 515 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 516 ... prepare packet in buf ... 517 ring->slot[i].len = ... packet length ... 518 ring->cur = NETMAP_RING_NEXT(ring, i); 519 } 520} 521.Ed 522.Sh SUPPORTED INTERFACES 523.Nm 524supports the following interfaces: 525.Xr em 4 , 526.Xr igb 4 , 527.Xr ixgbe 4 , 528.Xr lem 4 , 529.Xr re 4 530.Sh SEE ALSO 531.Xr vale 4 532.Pp 533http://info.iet.unipi.it/~luigi/netmap/ 534.Pp 535Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 536Communications of the ACM, 55 (3), pp.45-51, March 2012 537.Pp 538Luigi Rizzo, netmap: a novel framework for fast packet I/O, 539Usenix ATC'12, June 2012, Boston 540.Sh AUTHORS 541.An -nosplit 542The 543.Nm 544framework has been originally designed and implemented at the 545Universita` di Pisa in 2011 by 546.An Luigi Rizzo , 547and further extended with help from 548.An Matteo Landi , 549.An Gaetano Catalli , 550.An Giuseppe Lettieri , 551.An Vincenzo Maffione . 552.Pp 553.Nm 554and 555.Nm VALE 556have been funded by the European Commission within FP7 Projects 557CHANGE (257422) and OPENLAB (287581). 558