1.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd October 18, 2013 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.Sh SYNOPSIS 37.Cd device netmap 38.Sh DESCRIPTION 39.Nm 40is a framework for extremely fast and efficient packet I/O 41(reaching 14.88 Mpps with a single core at less than 1 GHz) 42for both userspace and kernel clients. 43Userspace clients can use the netmap API 44to send and receive raw packets through physical interfaces 45or ports of the 46.Xr VALE 4 47switch. 48.Pp 49.Nm VALE 50is a very fast (reaching 20 Mpps per port) 51and modular software switch, 52implemented within the kernel, which can interconnect 53virtual ports, physical devices, and the native host stack. 54.Pp 55.Nm 56uses a memory mapped region to share packet buffers, 57descriptors and queues with the kernel. 58Simple 59.Pa ioctl()s 60are used to bind interfaces/ports to file descriptors and 61implement non-blocking I/O, whereas blocking I/O uses 62.Pa select()/poll() . 63.Nm 64can exploit the parallelism in multiqueue devices and 65multicore systems. 66.Pp 67For the best performance, 68.Nm 69requires explicit support in device drivers; 70a generic emulation layer is available to implement the 71.Nm 72API on top of unmodified device drivers, 73at the price of reduced performance 74(but still better than what can be achieved with 75sockets or BPF/pcap). 76.Pp 77For a list of devices with native 78.Nm 79support, see the end of this manual page. 80.Sh OPERATION - THE NETMAP API 81.Nm 82clients must first 83.Pa open("/dev/netmap") , 84and then issue an 85.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg) 86to bind the file descriptor to a specific interface or port. 87.Nm 88has multiple modes of operation controlled by the 89content of the 90.Pa struct nmreq 91passed to the 92.Pa ioctl() . 93In particular, the 94.Em nr_name 95field specifies whether the client operates on a physical network 96interface or on a port of a 97.Nm VALE 98switch, as indicated below. Additional fields in the 99.Pa struct nmreq 100control the details of operation. 101.Bl -tag -width XXXX 102.It Dv Interface name (e.g. 'em0', 'eth1', ... ) 103The data path of the interface is disconnected from the host stack. 104Depending on additional arguments, 105the file descriptor is bound to the NIC (one or all queues), 106or to the host stack. 107.It Dv valeXXX:YYY (arbitrary XXX and YYY) 108The file descriptor is bound to port YYY of a VALE switch called XXX, 109where XXX and YYY are arbitrary alphanumeric strings. 110The string cannot exceed IFNAMSIZ characters, and YYY cannot 111matching the name of any existing interface. 112.Pp 113The switch and the port are created if not existing. 114.It Dv valeXXX:ifname (ifname is an existing interface) 115Flags in the argument control whether the physical interface 116(and optionally the corrisponding host stack endpoint) 117are connected or disconnected from the VALE switch named XXX. 118.Pp 119In this case the 120.Pa ioctl() 121is used only for configuring the VALE switch, typically through the 122.Nm vale-ctl 123command. 124The file descriptor cannot be used for I/O, and should be 125.Pa close()d 126after issuing the 127.Pa ioctl(). 128.El 129.Pp 130The binding can be removed (and the interface returns to 131regular operation, or the virtual port destroyed) with a 132.Pa close() 133on the file descriptor. 134.Pp 135The processes owning the file descriptor can then 136.Pa mmap() 137the memory region that contains pre-allocated 138buffers, descriptors and queues, and use them to 139read/write raw packets. 140Non blocking I/O is done with special 141.Pa ioctl()'s , 142whereas the file descriptor can be passed to 143.Pa select()/poll() 144to be notified about incoming packet or available transmit buffers. 145.Ss DATA STRUCTURES 146The data structures in the mmapped memory are described below 147(see 148.Xr sys/net/netmap.h 149for reference). 150All physical devices operating in 151.Nm 152mode use the same memory region, 153shared by the kernel and all processes who own 154.Pa /dev/netmap 155descriptors bound to those devices 156(NOTE: visibility may be restricted in future implementations). 157Virtual ports instead use separate memory regions, 158shared only with the kernel. 159.Pp 160All references between the shared data structure 161are relative (offsets or indexes). Some macros help converting 162them into actual pointers. 163.Bl -tag -width XXX 164.It Dv struct netmap_if (one per interface) 165indicates the number of rings supported by an interface, their 166sizes, and the offsets of the 167.Pa netmap_rings 168associated to the interface. 169.Pp 170.Pa struct netmap_if 171is at offset 172.Pa nr_offset 173in the shared memory region is indicated by the 174field in the structure returned by the 175.Pa NIOCREGIF 176(see below). 177.Bd -literal 178struct netmap_if { 179 char ni_name[IFNAMSIZ]; /* name of the interface. */ 180 const u_int ni_version; /* API version */ 181 const u_int ni_rx_rings; /* number of rx ring pairs */ 182 const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */ 183 const ssize_t ring_ofs[]; /* offset of tx and rx rings */ 184}; 185.Ed 186.It Dv struct netmap_ring (one per ring) 187Contains the positions in the transmit and receive rings to 188synchronize the kernel and the application, 189and an array of 190.Pa slots 191describing the buffers. 192'reserved' is used in receive rings to tell the kernel the 193number of slots after 'cur' that are still in usr 194indicates how many slots starting from 'cur' 195the 196.Pp 197Each physical interface has one 198.Pa netmap_ring 199for each hardware transmit and receive ring, 200plus one extra transmit and one receive structure 201that connect to the host stack. 202.Bd -literal 203struct netmap_ring { 204 const ssize_t buf_ofs; /* see details */ 205 const uint32_t num_slots; /* number of slots in the ring */ 206 uint32_t avail; /* number of usable slots */ 207 uint32_t cur; /* 'current' read/write index */ 208 uint32_t reserved; /* not refilled before current */ 209 210 const uint16_t nr_buf_size; 211 uint16_t flags; 212#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */ 213#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */ 214#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */ 215 struct timeval ts; 216 struct netmap_slot slot[0]; /* array of slots */ 217} 218.Ed 219.Pp 220In transmit rings, after a system call 'cur' indicates 221the first slot that can be used for transmissions, 222and 'avail' reports how many of them are available. 223Before the next netmap-related system call on the file 224descriptor, the application should fill buffers and 225slots with data, and update 'cur' and 'avail' 226accordingly, as shown in the figure below: 227.Bd -literal 228 229 cur 230 |----- avail ---| (after syscall) 231 v 232 TX [*****aaaaaaaaaaaaaaaaa**] 233 TX [*****TTTTTaaaaaaaaaaaa**] 234 ^ 235 |-- avail --| (before syscall) 236 cur 237.Ed 238In receive rings, after a system call 'cur' indicates 239the first slot that contains a valid packet, 240and 'avail' reports how many of them are available. 241Before the next netmap-related system call on the file 242descriptor, the application can process buffers and 243release them to the kernel updating 244'cur' and 'avail' accordingly, as shown in the figure below. 245Receive rings have an additional field called 'reserved' 246to indicate how many buffers before 'cur' are still 247under processing and cannot be released. 248.Bd -literal 249 cur 250 |-res-|-- avail --| (after syscall) 251 v 252 RX [**rrrrrrRRRRRRRRRRRR******] 253 RX [**...........rrrrRRR******] 254 |res|--|<avail (before syscall) 255 ^ 256 cur 257 258.Ed 259.It Dv struct netmap_slot (one per packet) 260contains the metadata for a packet: 261.Bd -literal 262struct netmap_slot { 263 uint32_t buf_idx; /* buffer index */ 264 uint16_t len; /* packet length */ 265 uint16_t flags; /* buf changed, etc. */ 266#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */ 267#define NS_REPORT 0x0002 /* tell hw to report results 268 * e.g. by generating an interrupt 269 */ 270#define NS_FORWARD 0x0004 /* pass packet to the other endpoint 271 * (host stack or device) 272 */ 273#define NS_NO_LEARN 0x0008 274#define NS_INDIRECT 0x0010 275#define NS_MOREFRAG 0x0020 276#define NS_PORT_SHIFT 8 277#define NS_PORT_MASK (0xff << NS_PORT_SHIFT) 278#define NS_RFRAGS(_slot) ( ((_slot)->flags >> 8) & 0xff) 279 uint64_t ptr; /* buffer address (indirect buffers) */ 280}; 281.Ed 282The flags control how the the buffer associated to the slot 283should be managed. 284.It Dv packet buffers 285are normally fixed size (2 Kbyte) buffers allocated by the kernel 286that contain packet data. Buffers addresses are computed through 287macros. 288.El 289.Bl -tag -width XXX 290Some macros support the access to objects in the shared memory 291region. In particular, 292.It NETMAP_TXRING(nifp, i) 293.It NETMAP_RXRING(nifp, i) 294return the address of the i-th transmit and receive ring, 295respectively, whereas 296.It NETMAP_BUF(ring, buf_idx) 297returns the address of the buffer with index buf_idx 298(which can be part of any ring for the given interface). 299.El 300.Pp 301Normally, buffers are associated to slots when interfaces are bound, 302and one packet is fully contained in a single buffer. 303Clients can however modify the mapping using the 304following flags: 305.Ss FLAGS 306.Bl -tag -width XXX 307.It NS_BUF_CHANGED 308indicates that the buf_idx in the slot has changed. 309This can be useful if the client wants to implement 310some form of zero-copy forwarding (e.g. by passing buffers 311from an input interface to an output interface), or 312needs to process packets out of order. 313.Pp 314The flag MUST be used whenever the buffer index is changed. 315.It NS_REPORT 316indicates that we want to be woken up when this buffer 317has been transmitted. This reduces performance but insures 318a prompt notification when a buffer has been sent. 319Normally, 320.Nm 321notifies transmit completions in batches, hence signals 322can be delayed indefinitely. However, we need such notifications 323before closing a descriptor. 324.It NS_FORWARD 325When the device is open in 'transparent' mode, 326the client can mark slots in receive rings with this flag. 327For all marked slots, marked packets are forwarded to 328the other endpoint at the next system call, thus restoring 329(in a selective way) the connection between the NIC and the 330host stack. 331.It NS_NO_LEARN 332tells the forwarding code that the SRC MAC address for this 333packet should not be used in the learning bridge 334.It NS_INDIRECT 335indicates that the packet's payload is not in the netmap 336supplied buffer, but in a user-supplied buffer whose 337user virtual address is in the 'ptr' field of the slot. 338The size can reach 65535 bytes. 339.Em This is only supported on the transmit ring of virtual ports 340.It NS_MOREFRAG 341indicates that the packet continues with subsequent buffers; 342the last buffer in a packet must have the flag clear. 343The maximum length of a chain is 64 buffers. 344.Em This is only supported on virtual ports 345.It NS_RFRAGS(slot) 346on receive rings, returns the number of remaining buffers 347in a packet, including this one. 348Slots with a value greater than 1 also have NS_MOREFRAG set. 349The length refers to the individual buffer, there is no 350field for the total length. 351.Pp 352On transmit rings, if NS_DST is set, it is passed to the lookup 353function, which can use it e.g. as the index of the destination 354port instead of doing an address lookup. 355.El 356.Sh IOCTLS 357.Nm 358supports some ioctl() to synchronize the state of the rings 359between the kernel and the user processes, plus some 360to query and configure the interface. 361The former do not require any argument, whereas the latter 362use a 363.Pa struct nmreq 364defined as follows: 365.Bd -literal 366struct nmreq { 367 char nr_name[IFNAMSIZ]; 368 uint32_t nr_version; /* API version */ 369#define NETMAP_API 4 /* current version */ 370 uint32_t nr_offset; /* nifp offset in the shared region */ 371 uint32_t nr_memsize; /* size of the shared region */ 372 uint32_t nr_tx_slots; /* slots in tx rings */ 373 uint32_t nr_rx_slots; /* slots in rx rings */ 374 uint16_t nr_tx_rings; /* number of tx rings */ 375 uint16_t nr_rx_rings; /* number of tx rings */ 376 uint16_t nr_ringid; /* ring(s) we care about */ 377#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */ 378#define NETMAP_SW_RING 0x2000 /* we process the sw ring */ 379#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ 380#define NETMAP_RING_MASK 0xfff /* the actual ring number */ 381 uint16_t nr_cmd; 382#define NETMAP_BDG_ATTACH 1 /* attach the NIC */ 383#define NETMAP_BDG_DETACH 2 /* detach the NIC */ 384#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */ 385#define NETMAP_BDG_LIST 4 /* get bridge's info */ 386 uint16_t nr_arg1; 387 uint16_t nr_arg2; 388 uint32_t spare2[3]; 389}; 390 391.Ed 392A device descriptor obtained through 393.Pa /dev/netmap 394also supports the ioctl supported by network devices. 395.Pp 396The netmap-specific 397.Xr ioctl 2 398command codes below are defined in 399.In net/netmap.h 400and are: 401.Bl -tag -width XXXX 402.It Dv NIOCGINFO 403returns EINVAL if the named device does not support netmap. 404Otherwise, it returns 0 and (advisory) information 405about the interface. 406Note that all the information below can change before the 407interface is actually put in netmap mode. 408.Pp 409.Pa nr_memsize 410indicates the size of the netmap 411memory region. Physical devices all share the same memory region, 412whereas VALE ports may have independent regions for each port. 413These sizes can be set through system-wise sysctl variables. 414.Pa nr_tx_slots, nr_rx_slots 415indicate the size of transmit and receive rings. 416.Pa nr_tx_rings, nr_rx_rings 417indicate the number of transmit 418and receive rings. 419Both ring number and sizes may be configured at runtime 420using interface-specific functions (e.g. 421.Pa sysctl 422or 423.Pa ethtool . 424.It Dv NIOCREGIF 425puts the interface named in nr_name into netmap mode, disconnecting 426it from the host stack, and/or defines which rings are controlled 427through this file descriptor. 428On return, it gives the same info as NIOCGINFO, and nr_ringid 429indicates the identity of the rings controlled through the file 430descriptor. 431.Pp 432Possible values for nr_ringid are 433.Bl -tag -width XXXXX 434.It 0 435default, all hardware rings 436.It NETMAP_SW_RING 437the ``host rings'' connecting to the host stack 438.It NETMAP_HW_RING + i 439the i-th hardware ring 440.El 441By default, a 442.Nm poll 443or 444.Nm select 445call pushes out any pending packets on the transmit ring, even if 446no write events are specified. 447The feature can be disabled by or-ing 448.Nm NETMAP_NO_TX_SYNC 449to nr_ringid. 450But normally you should keep this feature unless you are using 451separate file descriptors for the send and receive rings, because 452otherwise packets are pushed out only if NETMAP_TXSYNC is called, 453or the send queue is full. 454.Pp 455.Pa NIOCREGIF 456can be used multiple times to change the association of a 457file descriptor to a ring pair, always within the same device. 458.Pp 459When registering a virtual interface that is dynamically created to a 460.Xr vale 4 461switch, we can specify the desired number of rings (1 by default, 462and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 463.It Dv NIOCTXSYNC 464tells the hardware of new packets to transmit, and updates the 465number of slots available for transmission. 466.It Dv NIOCRXSYNC 467tells the hardware of consumed packets, and asks for newly available 468packets. 469.El 470.Sh SYSTEM CALLS 471.Nm 472uses 473.Xr select 2 474and 475.Xr poll 2 476to wake up processes when significant events occur, and 477.Xr mmap 2 478to map memory. 479.Pp 480Applications may need to create threads and bind them to 481specific cores to improve performance, using standard 482OS primitives, see 483.Xr pthread 3 . 484In particular, 485.Xr pthread_setaffinity_np 3 486may be of use. 487.Sh EXAMPLES 488The following code implements a traffic generator 489.Pp 490.Bd -literal -compact 491#include <net/netmap.h> 492#include <net/netmap_user.h> 493struct netmap_if *nifp; 494struct netmap_ring *ring; 495struct nmreq nmr; 496 497fd = open("/dev/netmap", O_RDWR); 498bzero(&nmr, sizeof(nmr)); 499strcpy(nmr.nr_name, "ix0"); 500nmr.nm_version = NETMAP_API; 501ioctl(fd, NIOCREGIF, &nmr); 502p = mmap(0, nmr.nr_memsize, fd); 503nifp = NETMAP_IF(p, nmr.nr_offset); 504ring = NETMAP_TXRING(nifp, 0); 505fds.fd = fd; 506fds.events = POLLOUT; 507for (;;) { 508 poll(list, 1, -1); 509 for ( ; ring->avail > 0 ; ring->avail--) { 510 i = ring->cur; 511 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 512 ... prepare packet in buf ... 513 ring->slot[i].len = ... packet length ... 514 ring->cur = NETMAP_RING_NEXT(ring, i); 515 } 516} 517.Ed 518.Sh SUPPORTED INTERFACES 519.Nm 520supports the following interfaces: 521.Xr em 4 , 522.Xr igb 4 , 523.Xr ixgbe 4 , 524.Xr lem 4 , 525.Xr re 4 526.Sh SEE ALSO 527.Xr vale 4 528.Pp 529http://info.iet.unipi.it/~luigi/netmap/ 530.Pp 531Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 532Communications of the ACM, 55 (3), pp.45-51, March 2012 533.Pp 534Luigi Rizzo, netmap: a novel framework for fast packet I/O, 535Usenix ATC'12, June 2012, Boston 536.Sh AUTHORS 537.An -nosplit 538The 539.Nm 540framework has been originally designed and implemented at the 541Universita` di Pisa in 2011 by 542.An Luigi Rizzo , 543and further extended with help from 544.An Matteo Landi , 545.An Gaetano Catalli , 546.An Giuseppe Lettieri , 547.An Vincenzo Maffione . 548.Pp 549.Nm 550and 551.Nm VALE 552have been funded by the European Commission within FP7 Projects 553CHANGE (257422) and OPENLAB (287581). 554