.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" This document is derived in part from the enet man page (enet.4) .\" distributed with 4.3BSD Unix. .\" .\" $FreeBSD$ .\" .Dd October 18, 2013 .Dt NETMAP 4 .Os .Sh NAME .Nm netmap .Nd a framework for fast packet I/O .Sh SYNOPSIS .Cd device netmap .Sh DESCRIPTION .Nm is a framework for extremely fast and efficient packet I/O (reaching 14.88 Mpps with a single core at less than 1 GHz) for both userspace and kernel clients. Userspace clients can use the netmap API to send and receive raw packets through physical interfaces or ports of the .Xr VALE 4 switch. .Pp .Nm VALE is a very fast (reaching 20 Mpps per port) and modular software switch, implemented within the kernel, which can interconnect virtual ports, physical devices, and the native host stack. .Pp .Nm uses a memory mapped region to share packet buffers, descriptors and queues with the kernel. Simple .Pa ioctl()s are used to bind interfaces/ports to file descriptors and implement non-blocking I/O, whereas blocking I/O uses .Pa select()/poll() . .Nm can exploit the parallelism in multiqueue devices and multicore systems. .Pp For the best performance, .Nm requires explicit support in device drivers; a generic emulation layer is available to implement the .Nm API on top of unmodified device drivers, at the price of reduced performance (but still better than what can be achieved with sockets or BPF/pcap). .Pp For a list of devices with native .Nm support, see the end of this manual page. .Sh OPERATION - THE NETMAP API .Nm clients must first .Pa open("/dev/netmap") , and then issue an .Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg) to bind the file descriptor to a specific interface or port. .Nm has multiple modes of operation controlled by the content of the .Pa struct nmreq passed to the .Pa ioctl() . In particular, the .Em nr_name field specifies whether the client operates on a physical network interface or on a port of a .Nm VALE switch, as indicated below. Additional fields in the .Pa struct nmreq control the details of operation. .Bl -tag -width XXXX .It Dv Interface name (e.g. 'em0', 'eth1', ... ) The data path of the interface is disconnected from the host stack. Depending on additional arguments, the file descriptor is bound to the NIC (one or all queues), or to the host stack. .It Dv valeXXX:YYY (arbitrary XXX and YYY) The file descriptor is bound to port YYY of a VALE switch called XXX, where XXX and YYY are arbitrary alphanumeric strings. The string cannot exceed IFNAMSIZ characters, and YYY cannot matching the name of any existing interface. .Pp The switch and the port are created if not existing. .It Dv valeXXX:ifname (ifname is an existing interface) Flags in the argument control whether the physical interface (and optionally the corrisponding host stack endpoint) are connected or disconnected from the VALE switch named XXX. .Pp In this case the .Pa ioctl() is used only for configuring the VALE switch, typically through the .Nm vale-ctl command. The file descriptor cannot be used for I/O, and should be .Pa close()d after issuing the .Pa ioctl(). .El .Pp The binding can be removed (and the interface returns to regular operation, or the virtual port destroyed) with a .Pa close() on the file descriptor. .Pp The processes owning the file descriptor can then .Pa mmap() the memory region that contains pre-allocated buffers, descriptors and queues, and use them to read/write raw packets. Non blocking I/O is done with special .Pa ioctl()'s , whereas the file descriptor can be passed to .Pa select()/poll() to be notified about incoming packet or available transmit buffers. .Ss DATA STRUCTURES The data structures in the mmapped memory are described below (see .Xr sys/net/netmap.h for reference). All physical devices operating in .Nm mode use the same memory region, shared by the kernel and all processes who own .Pa /dev/netmap descriptors bound to those devices (NOTE: visibility may be restricted in future implementations). Virtual ports instead use separate memory regions, shared only with the kernel. .Pp All references between the shared data structure are relative (offsets or indexes). Some macros help converting them into actual pointers. .Bl -tag -width XXX .It Dv struct netmap_if (one per interface) indicates the number of rings supported by an interface, their sizes, and the offsets of the .Pa netmap_rings associated to the interface. .Pp .Pa struct netmap_if is at offset .Pa nr_offset in the shared memory region is indicated by the field in the structure returned by the .Pa NIOCREGIF (see below). .Bd -literal struct netmap_if { char ni_name[IFNAMSIZ]; /* name of the interface. */ const u_int ni_version; /* API version */ const u_int ni_rx_rings; /* number of rx ring pairs */ const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */ const ssize_t ring_ofs[]; /* offset of tx and rx rings */ }; .Ed .It Dv struct netmap_ring (one per ring) Contains the positions in the transmit and receive rings to synchronize the kernel and the application, and an array of .Pa slots describing the buffers. 'reserved' is used in receive rings to tell the kernel the number of slots after 'cur' that are still in usr indicates how many slots starting from 'cur' the .Pp Each physical interface has one .Pa netmap_ring for each hardware transmit and receive ring, plus one extra transmit and one receive structure that connect to the host stack. .Bd -literal struct netmap_ring { const ssize_t buf_ofs; /* see details */ const uint32_t num_slots; /* number of slots in the ring */ uint32_t avail; /* number of usable slots */ uint32_t cur; /* 'current' read/write index */ uint32_t reserved; /* not refilled before current */ const uint16_t nr_buf_size; uint16_t flags; #define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */ #define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */ #define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */ struct timeval ts; struct netmap_slot slot[0]; /* array of slots */ } .Ed .Pp In transmit rings, after a system call 'cur' indicates the first slot that can be used for transmissions, and 'avail' reports how many of them are available. Before the next netmap-related system call on the file descriptor, the application should fill buffers and slots with data, and update 'cur' and 'avail' accordingly, as shown in the figure below: .Bd -literal cur |----- avail ---| (after syscall) v TX [*****aaaaaaaaaaaaaaaaa**] TX [*****TTTTTaaaaaaaaaaaa**] ^ |-- avail --| (before syscall) cur .Ed In receive rings, after a system call 'cur' indicates the first slot that contains a valid packet, and 'avail' reports how many of them are available. Before the next netmap-related system call on the file descriptor, the application can process buffers and release them to the kernel updating 'cur' and 'avail' accordingly, as shown in the figure below. Receive rings have an additional field called 'reserved' to indicate how many buffers before 'cur' are still under processing and cannot be released. .Bd -literal cur |-res-|-- avail --| (after syscall) v RX [**rrrrrrRRRRRRRRRRRR******] RX [**...........rrrrRRR******] |res|--|flags >> 8) & 0xff) uint64_t ptr; /* buffer address (indirect buffers) */ }; .Ed The flags control how the the buffer associated to the slot should be managed. .It Dv packet buffers are normally fixed size (2 Kbyte) buffers allocated by the kernel that contain packet data. Buffers addresses are computed through macros. .El .Bl -tag -width XXX Some macros support the access to objects in the shared memory region. In particular, .It NETMAP_TXRING(nifp, i) .It NETMAP_RXRING(nifp, i) return the address of the i-th transmit and receive ring, respectively, whereas .It NETMAP_BUF(ring, buf_idx) returns the address of the buffer with index buf_idx (which can be part of any ring for the given interface). .El .Pp Normally, buffers are associated to slots when interfaces are bound, and one packet is fully contained in a single buffer. Clients can however modify the mapping using the following flags: .Ss FLAGS .Bl -tag -width XXX .It NS_BUF_CHANGED indicates that the buf_idx in the slot has changed. This can be useful if the client wants to implement some form of zero-copy forwarding (e.g. by passing buffers from an input interface to an output interface), or needs to process packets out of order. .Pp The flag MUST be used whenever the buffer index is changed. .It NS_REPORT indicates that we want to be woken up when this buffer has been transmitted. This reduces performance but insures a prompt notification when a buffer has been sent. Normally, .Nm notifies transmit completions in batches, hence signals can be delayed indefinitely. However, we need such notifications before closing a descriptor. .It NS_FORWARD When the device is open in 'transparent' mode, the client can mark slots in receive rings with this flag. For all marked slots, marked packets are forwarded to the other endpoint at the next system call, thus restoring (in a selective way) the connection between the NIC and the host stack. .It NS_NO_LEARN tells the forwarding code that the SRC MAC address for this packet should not be used in the learning bridge .It NS_INDIRECT indicates that the packet's payload is not in the netmap supplied buffer, but in a user-supplied buffer whose user virtual address is in the 'ptr' field of the slot. The size can reach 65535 bytes. .Em This is only supported on the transmit ring of virtual ports .It NS_MOREFRAG indicates that the packet continues with subsequent buffers; the last buffer in a packet must have the flag clear. The maximum length of a chain is 64 buffers. .Em This is only supported on virtual ports .It NS_RFRAGS(slot) on receive rings, returns the number of remaining buffers in a packet, including this one. Slots with a value greater than 1 also have NS_MOREFRAG set. The length refers to the individual buffer, there is no field for the total length. .Pp On transmit rings, if NS_DST is set, it is passed to the lookup function, which can use it e.g. as the index of the destination port instead of doing an address lookup. .El .Sh IOCTLS .Nm supports some ioctl() to synchronize the state of the rings between the kernel and the user processes, plus some to query and configure the interface. The former do not require any argument, whereas the latter use a .Pa struct nmreq defined as follows: .Bd -literal struct nmreq { char nr_name[IFNAMSIZ]; uint32_t nr_version; /* API version */ #define NETMAP_API 4 /* current version */ uint32_t nr_offset; /* nifp offset in the shared region */ uint32_t nr_memsize; /* size of the shared region */ uint32_t nr_tx_slots; /* slots in tx rings */ uint32_t nr_rx_slots; /* slots in rx rings */ uint16_t nr_tx_rings; /* number of tx rings */ uint16_t nr_rx_rings; /* number of tx rings */ uint16_t nr_ringid; /* ring(s) we care about */ #define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */ #define NETMAP_SW_RING 0x2000 /* we process the sw ring */ #define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ #define NETMAP_RING_MASK 0xfff /* the actual ring number */ uint16_t nr_cmd; #define NETMAP_BDG_ATTACH 1 /* attach the NIC */ #define NETMAP_BDG_DETACH 2 /* detach the NIC */ #define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */ #define NETMAP_BDG_LIST 4 /* get bridge's info */ uint16_t nr_arg1; uint16_t nr_arg2; uint32_t spare2[3]; }; .Ed A device descriptor obtained through .Pa /dev/netmap also supports the ioctl supported by network devices. .Pp The netmap-specific .Xr ioctl 2 command codes below are defined in .In net/netmap.h and are: .Bl -tag -width XXXX .It Dv NIOCGINFO returns EINVAL if the named device does not support netmap. Otherwise, it returns 0 and (advisory) information about the interface. Note that all the information below can change before the interface is actually put in netmap mode. .Pp .Pa nr_memsize indicates the size of the netmap memory region. Physical devices all share the same memory region, whereas VALE ports may have independent regions for each port. These sizes can be set through system-wise sysctl variables. .Pa nr_tx_slots, nr_rx_slots indicate the size of transmit and receive rings. .Pa nr_tx_rings, nr_rx_rings indicate the number of transmit and receive rings. Both ring number and sizes may be configured at runtime using interface-specific functions (e.g. .Pa sysctl or .Pa ethtool . .It Dv NIOCREGIF puts the interface named in nr_name into netmap mode, disconnecting it from the host stack, and/or defines which rings are controlled through this file descriptor. On return, it gives the same info as NIOCGINFO, and nr_ringid indicates the identity of the rings controlled through the file descriptor. .Pp Possible values for nr_ringid are .Bl -tag -width XXXXX .It 0 default, all hardware rings .It NETMAP_SW_RING the ``host rings'' connecting to the host stack .It NETMAP_HW_RING + i the i-th hardware ring .El By default, a .Nm poll or .Nm select call pushes out any pending packets on the transmit ring, even if no write events are specified. The feature can be disabled by or-ing .Nm NETMAP_NO_TX_SYNC to nr_ringid. But normally you should keep this feature unless you are using separate file descriptors for the send and receive rings, because otherwise packets are pushed out only if NETMAP_TXSYNC is called, or the send queue is full. .Pp .Pa NIOCREGIF can be used multiple times to change the association of a file descriptor to a ring pair, always within the same device. .Pp When registering a virtual interface that is dynamically created to a .Xr vale 4 switch, we can specify the desired number of rings (1 by default, and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. .It Dv NIOCTXSYNC tells the hardware of new packets to transmit, and updates the number of slots available for transmission. .It Dv NIOCRXSYNC tells the hardware of consumed packets, and asks for newly available packets. .El .Sh SYSTEM CALLS .Nm uses .Xr select 2 and .Xr poll 2 to wake up processes when significant events occur, and .Xr mmap 2 to map memory. .Pp Applications may need to create threads and bind them to specific cores to improve performance, using standard OS primitives, see .Xr pthread 3 . In particular, .Xr pthread_setaffinity_np 3 may be of use. .Sh EXAMPLES The following code implements a traffic generator .Pp .Bd -literal -compact #include #include struct netmap_if *nifp; struct netmap_ring *ring; struct nmreq nmr; fd = open("/dev/netmap", O_RDWR); bzero(&nmr, sizeof(nmr)); strcpy(nmr.nr_name, "ix0"); nmr.nm_version = NETMAP_API; ioctl(fd, NIOCREGIF, &nmr); p = mmap(0, nmr.nr_memsize, fd); nifp = NETMAP_IF(p, nmr.nr_offset); ring = NETMAP_TXRING(nifp, 0); fds.fd = fd; fds.events = POLLOUT; for (;;) { poll(list, 1, -1); for ( ; ring->avail > 0 ; ring->avail--) { i = ring->cur; buf = NETMAP_BUF(ring, ring->slot[i].buf_index); ... prepare packet in buf ... ring->slot[i].len = ... packet length ... ring->cur = NETMAP_RING_NEXT(ring, i); } } .Ed .Sh SUPPORTED INTERFACES .Nm supports the following interfaces: .Xr em 4 , .Xr igb 4 , .Xr ixgbe 4 , .Xr lem 4 , .Xr re 4 .Sh SEE ALSO .Xr vale 4 .Pp http://info.iet.unipi.it/~luigi/netmap/ .Pp Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, Communications of the ACM, 55 (3), pp.45-51, March 2012 .Pp Luigi Rizzo, netmap: a novel framework for fast packet I/O, Usenix ATC'12, June 2012, Boston .Sh AUTHORS .An -nosplit The .Nm framework has been originally designed and implemented at the Universita` di Pisa in 2011 by .An Luigi Rizzo , and further extended with help from .An Matteo Landi , .An Gaetano Catalli , .An Giuseppe Lettieri , .An Vincenzo Maffione . .Pp .Nm and .Nm VALE have been funded by the European Commission within FP7 Projects CHANGE (257422) and OPENLAB (287581).