1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd December 14, 2015 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.br 37.Nm VALE 38.Nd a fast VirtuAl Local Ethernet using the netmap API 39.br 40.Nm netmap pipes 41.Nd a shared memory packet transport channel 42.Sh SYNOPSIS 43.Cd device netmap 44.Sh DESCRIPTION 45.Nm 46is a framework for extremely fast and efficient packet I/O 47for userspace and kernel clients, and for Virtual Machines. 48It runs on 49.Fx 50Linux and some versions of Windows, and supports a variety of 51.Nm netmap ports , 52including 53.Bl -tag -width XXXX 54.It Nm physical NIC ports 55to access individual queues of network interfaces; 56.It Nm host ports 57to inject packets into the host stack; 58.It Nm VALE ports 59implementing a very fast and modular in-kernel software switch/dataplane; 60.It Nm netmap pipes 61a shared memory packet transport channel; 62.It Nm netmap monitors 63a mechanism similar to 64.Xr bpf 65to capture traffic 66.El 67.Pp 68All these 69.Nm netmap ports 70are accessed interchangeably with the same API, 71and are at least one order of magnitude faster than 72standard OS mechanisms 73(sockets, bpf, tun/tap interfaces, native switches, pipes). 74With suitably fast hardware (NICs, PCIe buses, CPUs), 75packet I/O using 76.Nm 77on supported NICs 78reaches 14.88 million packets per second (Mpps) 79with much less than one core on 10 Gbit/s NICs; 8035-40 Mpps on 40 Gbit/s NICs (limited by the hardware); 81about 20 Mpps per core for VALE ports; 82and over 100 Mpps for 83.Nm netmap pipes. 84NICs without native 85.Nm 86support can still use the API in emulated mode, 87which uses unmodified device drivers and is 3-5 times faster than 88.Xr bpf 89or raw sockets. 90.Pp 91Userspace clients can dynamically switch NICs into 92.Nm 93mode and send and receive raw packets through 94memory mapped buffers. 95Similarly, 96.Nm VALE 97switch instances and ports, 98.Nm netmap pipes 99and 100.Nm netmap monitors 101can be created dynamically, 102providing high speed packet I/O between processes, 103virtual machines, NICs and the host stack. 104.Pp 105.Nm 106supports both non-blocking I/O through 107.Xr ioctl 2 , 108synchronization and blocking I/O through a file descriptor 109and standard OS mechanisms such as 110.Xr select 2 , 111.Xr poll 2 , 112.Xr epoll 2 , 113and 114.Xr kqueue 2 . 115All types of 116.Nm netmap ports 117and the 118.Nm VALE switch 119are implemented by a single kernel module, which also emulates the 120.Nm 121API over standard drivers. 122For best performance, 123.Nm 124requires native support in device drivers. 125A list of such devices is at the end of this document. 126.Pp 127In the rest of this (long) manual page we document 128various aspects of the 129.Nm 130and 131.Nm VALE 132architecture, features and usage. 133.Sh ARCHITECTURE 134.Nm 135supports raw packet I/O through a 136.Em port , 137which can be connected to a physical interface 138.Em ( NIC ) , 139to the host stack, 140or to a 141.Nm VALE 142switch. 143Ports use preallocated circular queues of buffers 144.Em ( rings ) 145residing in an mmapped region. 146There is one ring for each transmit/receive queue of a 147NIC or virtual port. 148An additional ring pair connects to the host stack. 149.Pp 150After binding a file descriptor to a port, a 151.Nm 152client can send or receive packets in batches through 153the rings, and possibly implement zero-copy forwarding 154between ports. 155.Pp 156All NICs operating in 157.Nm 158mode use the same memory region, 159accessible to all processes who own 160.Pa /dev/netmap 161file descriptors bound to NICs. 162Independent 163.Nm VALE 164and 165.Nm netmap pipe 166ports 167by default use separate memory regions, 168but can be independently configured to share memory. 169.Sh ENTERING AND EXITING NETMAP MODE 170The following section describes the system calls to create 171and control 172.Nm netmap 173ports (including 174.Nm VALE 175and 176.Nm netmap pipe 177ports). 178Simpler, higher level functions are described in section 179.Xr LIBRARIES . 180.Pp 181Ports and rings are created and controlled through a file descriptor, 182created by opening a special device 183.Dl fd = open("/dev/netmap"); 184and then bound to a specific port with an 185.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 186.Pp 187.Nm 188has multiple modes of operation controlled by the 189.Vt struct nmreq 190argument. 191.Va arg.nr_name 192specifies the netmap port name, as follows: 193.Bl -tag -width XXXX 194.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) 195the data path of the NIC is disconnected from the host stack, 196and the file descriptor is bound to the NIC (one or all queues), 197or to the host stack; 198.It Dv valeSSS:PPP 199the file descriptor is bound to port PPP of VALE switch SSS. 200Switch instances and ports are dynamically created if necessary. 201.br 202Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string 203cannot exceed IFNAMSIZ characters, and PPP cannot 204be the name of any existing OS network interface. 205.El 206.Pp 207On return, 208.Va arg 209indicates the size of the shared memory region, 210and the number, size and location of all the 211.Nm 212data structures, which can be accessed by mmapping the memory 213.Dl char *mem = mmap(0, arg.nr_memsize, fd); 214.Pp 215Non-blocking I/O is done with special 216.Xr ioctl 2 217.Xr select 2 218and 219.Xr poll 2 220on the file descriptor permit blocking I/O. 221.Xr epoll 2 222and 223.Xr kqueue 2 224are not supported on 225.Nm 226file descriptors. 227.Pp 228While a NIC is in 229.Nm 230mode, the OS will still believe the interface is up and running. 231OS-generated packets for that NIC end up into a 232.Nm 233ring, and another ring is used to send packets into the OS network stack. 234A 235.Xr close 2 236on the file descriptor removes the binding, 237and returns the NIC to normal mode (reconnecting the data path 238to the host stack), or destroys the virtual port. 239.Sh DATA STRUCTURES 240The data structures in the mmapped memory region are detailed in 241.In sys/net/netmap.h , 242which is the ultimate reference for the 243.Nm 244API. 245The main structures and fields are indicated below: 246.Bl -tag -width XXX 247.It Dv struct netmap_if (one per interface) 248.Bd -literal 249struct netmap_if { 250 ... 251 const uint32_t ni_flags; /* properties */ 252 ... 253 const uint32_t ni_tx_rings; /* NIC tx rings */ 254 const uint32_t ni_rx_rings; /* NIC rx rings */ 255 uint32_t ni_bufs_head; /* head of extra bufs list */ 256 ... 257}; 258.Ed 259.Pp 260Indicates the number of available rings 261.Pa ( struct netmap_rings ) 262and their position in the mmapped region. 263The number of tx and rx rings 264.Pa ( ni_tx_rings , ni_rx_rings ) 265normally depends on the hardware. 266NICs also have an extra tx/rx ring pair connected to the host stack. 267.Em NIOCREGIF 268can also request additional unbound buffers in the same memory space, 269to be used as temporary storage for packets. 270.Pa ni_bufs_head 271contains the index of the first of these free rings, 272which are connected in a list (the first uint32_t of each 273buffer being the index of the next buffer in the list). 274A 275.Dv 0 276indicates the end of the list. 277.It Dv struct netmap_ring (one per ring) 278.Bd -literal 279struct netmap_ring { 280 ... 281 const uint32_t num_slots; /* slots in each ring */ 282 const uint32_t nr_buf_size; /* size of each buffer */ 283 ... 284 uint32_t head; /* (u) first buf owned by user */ 285 uint32_t cur; /* (u) wakeup position */ 286 const uint32_t tail; /* (k) first buf owned by kernel */ 287 ... 288 uint32_t flags; 289 struct timeval ts; /* (k) time of last rxsync() */ 290 ... 291 struct netmap_slot slot[0]; /* array of slots */ 292} 293.Ed 294.Pp 295Implements transmit and receive rings, with read/write 296pointers, metadata and an array of 297.Em slots 298describing the buffers. 299.It Dv struct netmap_slot (one per buffer) 300.Bd -literal 301struct netmap_slot { 302 uint32_t buf_idx; /* buffer index */ 303 uint16_t len; /* packet length */ 304 uint16_t flags; /* buf changed, etc. */ 305 uint64_t ptr; /* address for indirect buffers */ 306}; 307.Ed 308.Pp 309Describes a packet buffer, which normally is identified by 310an index and resides in the mmapped region. 311.It Dv packet buffers 312Fixed size (normally 2 KB) packet buffers allocated by the kernel. 313.El 314.Pp 315The offset of the 316.Pa struct netmap_if 317in the mmapped region is indicated by the 318.Pa nr_offset 319field in the structure returned by 320.Dv NIOCREGIF . 321From there, all other objects are reachable through 322relative references (offsets or indexes). 323Macros and functions in 324.In net/netmap_user.h 325help converting them into actual pointers: 326.Pp 327.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 328.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 329.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 330.Pp 331.Dl char *buf = NETMAP_BUF(ring, buffer_index); 332.Sh RINGS, BUFFERS AND DATA I/O 333.Va Rings 334are circular queues of packets with three indexes/pointers 335.Va ( head , cur , tail ) ; 336one slot is always kept empty. 337The ring size 338.Va ( num_slots ) 339should not be assumed to be a power of two. 340.Pp 341.Va head 342is the first slot available to userspace; 343.br 344.Va cur 345is the wakeup point: 346select/poll will unblock when 347.Va tail 348passes 349.Va cur ; 350.br 351.Va tail 352is the first slot reserved to the kernel. 353.Pp 354Slot indexes 355.Em must 356only move forward; 357for convenience, the function 358.Dl nm_ring_next(ring, index) 359returns the next index modulo the ring size. 360.Pp 361.Va head 362and 363.Va cur 364are only modified by the user program; 365.Va tail 366is only modified by the kernel. 367The kernel only reads/writes the 368.Vt struct netmap_ring 369slots and buffers 370during the execution of a netmap-related system call. 371The only exception are slots (and buffers) in the range 372.Va tail\ . . . head-1 , 373that are explicitly assigned to the kernel. 374.Pp 375.Ss TRANSMIT RINGS 376On transmit rings, after a 377.Nm 378system call, slots in the range 379.Va head\ . . . tail-1 380are available for transmission. 381User code should fill the slots sequentially 382and advance 383.Va head 384and 385.Va cur 386past slots ready to transmit. 387.Va cur 388may be moved further ahead if the user code needs 389more slots before further transmissions (see 390.Sx SCATTER GATHER I/O ) . 391.Pp 392At the next NIOCTXSYNC/select()/poll(), 393slots up to 394.Va head-1 395are pushed to the port, and 396.Va tail 397may advance if further slots have become available. 398Below is an example of the evolution of a TX ring: 399.Bd -literal 400 after the syscall, slots between cur and tail are (a)vailable 401 head=cur tail 402 | | 403 v v 404 TX [.....aaaaaaaaaaa.............] 405 406 user creates new packets to (T)ransmit 407 head=cur tail 408 | | 409 v v 410 TX [.....TTTTTaaaaaa.............] 411 412 NIOCTXSYNC/poll()/select() sends packets and reports new slots 413 head=cur tail 414 | | 415 v v 416 TX [..........aaaaaaaaaaa........] 417.Ed 418.Pp 419.Fn select 420and 421.Fn poll 422will block if there is no space in the ring, i.e. 423.Dl ring->cur == ring->tail 424and return when new slots have become available. 425.Pp 426High speed applications may want to amortize the cost of system calls 427by preparing as many packets as possible before issuing them. 428.Pp 429A transmit ring with pending transmissions has 430.Dl ring->head != ring->tail + 1 (modulo the ring size). 431The function 432.Va int nm_tx_pending(ring) 433implements this test. 434.Ss RECEIVE RINGS 435On receive rings, after a 436.Nm 437system call, the slots in the range 438.Va head\& . . . tail-1 439contain received packets. 440User code should process them and advance 441.Va head 442and 443.Va cur 444past slots it wants to return to the kernel. 445.Va cur 446may be moved further ahead if the user code wants to 447wait for more packets 448without returning all the previous slots to the kernel. 449.Pp 450At the next NIOCRXSYNC/select()/poll(), 451slots up to 452.Va head-1 453are returned to the kernel for further receives, and 454.Va tail 455may advance to report new incoming packets. 456.br 457Below is an example of the evolution of an RX ring: 458.Bd -literal 459 after the syscall, there are some (h)eld and some (R)eceived slots 460 head cur tail 461 | | | 462 v v v 463 RX [..hhhhhhRRRRRRRR..........] 464 465 user advances head and cur, releasing some slots and holding others 466 head cur tail 467 | | | 468 v v v 469 RX [..*****hhhRRRRRR...........] 470 471 NICRXSYNC/poll()/select() recovers slots and reports new packets 472 head cur tail 473 | | | 474 v v v 475 RX [.......hhhRRRRRRRRRRRR....] 476.Ed 477.Sh SLOTS AND PACKET BUFFERS 478Normally, packets should be stored in the netmap-allocated buffers 479assigned to slots when ports are bound to a file descriptor. 480One packet is fully contained in a single buffer. 481.Pp 482The following flags affect slot and buffer processing: 483.Bl -tag -width XXX 484.It NS_BUF_CHANGED 485.Em must 486be used when the 487.Va buf_idx 488in the slot is changed. 489This can be used to implement 490zero-copy forwarding, see 491.Sx ZERO-COPY FORWARDING . 492.It NS_REPORT 493reports when this buffer has been transmitted. 494Normally, 495.Nm 496notifies transmit completions in batches, hence signals 497can be delayed indefinitely. 498This flag helps detect 499when packets have been sent and a file descriptor can be closed. 500.It NS_FORWARD 501When a ring is in 'transparent' mode (see 502.Sx TRANSPARENT MODE ) , 503packets marked with this flag are forwarded to the other endpoint 504at the next system call, thus restoring (in a selective way) 505the connection between a NIC and the host stack. 506.It NS_NO_LEARN 507tells the forwarding code that the source MAC address for this 508packet must not be used in the learning bridge code. 509.It NS_INDIRECT 510indicates that the packet's payload is in a user-supplied buffer 511whose user virtual address is in the 'ptr' field of the slot. 512The size can reach 65535 bytes. 513.br 514This is only supported on the transmit ring of 515.Nm VALE 516ports, and it helps reducing data copies in the interconnection 517of virtual machines. 518.It NS_MOREFRAG 519indicates that the packet continues with subsequent buffers; 520the last buffer in a packet must have the flag clear. 521.El 522.Sh SCATTER GATHER I/O 523Packets can span multiple slots if the 524.Va NS_MOREFRAG 525flag is set in all but the last slot. 526The maximum length of a chain is 64 buffers. 527This is normally used with 528.Nm VALE 529ports when connecting virtual machines, as they generate large 530TSO segments that are not split unless they reach a physical device. 531.Pp 532NOTE: The length field always refers to the individual 533fragment; there is no place with the total length of a packet. 534.Pp 535On receive rings the macro 536.Va NS_RFRAGS(slot) 537indicates the remaining number of slots for this packet, 538including the current one. 539Slots with a value greater than 1 also have NS_MOREFRAG set. 540.Sh IOCTLS 541.Nm 542uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 543for non-blocking I/O. 544They take no argument. 545Two more ioctls (NIOCGINFO, NIOCREGIF) are used 546to query and configure ports, with the following argument: 547.Bd -literal 548struct nmreq { 549 char nr_name[IFNAMSIZ]; /* (i) port name */ 550 uint32_t nr_version; /* (i) API version */ 551 uint32_t nr_offset; /* (o) nifp offset in mmap region */ 552 uint32_t nr_memsize; /* (o) size of the mmap region */ 553 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ 554 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ 555 uint16_t nr_tx_rings; /* (i/o) number of tx rings */ 556 uint16_t nr_rx_rings; /* (i/o) number of rx rings */ 557 uint16_t nr_ringid; /* (i/o) ring(s) we care about */ 558 uint16_t nr_cmd; /* (i) special command */ 559 uint16_t nr_arg1; /* (i/o) extra arguments */ 560 uint16_t nr_arg2; /* (i/o) extra arguments */ 561 uint32_t nr_arg3; /* (i/o) extra arguments */ 562 uint32_t nr_flags /* (i/o) open mode */ 563 ... 564}; 565.Ed 566.Pp 567A file descriptor obtained through 568.Pa /dev/netmap 569also supports the ioctl supported by network devices, see 570.Xr netintro 4 . 571.Bl -tag -width XXXX 572.It Dv NIOCGINFO 573returns EINVAL if the named port does not support netmap. 574Otherwise, it returns 0 and (advisory) information 575about the port. 576Note that all the information below can change before the 577interface is actually put in netmap mode. 578.Bl -tag -width XX 579.It Pa nr_memsize 580indicates the size of the 581.Nm 582memory region. 583NICs in 584.Nm 585mode all share the same memory region, 586whereas 587.Nm VALE 588ports have independent regions for each port. 589.It Pa nr_tx_slots , nr_rx_slots 590indicate the size of transmit and receive rings. 591.It Pa nr_tx_rings , nr_rx_rings 592indicate the number of transmit 593and receive rings. 594Both ring number and sizes may be configured at runtime 595using interface-specific functions (e.g. 596.Xr ethtool 597). 598.El 599.It Dv NIOCREGIF 600binds the port named in 601.Va nr_name 602to the file descriptor. 603For a physical device this also switches it into 604.Nm 605mode, disconnecting 606it from the host stack. 607Multiple file descriptors can be bound to the same port, 608with proper synchronization left to the user. 609.Pp 610The recommended way to bind a file descriptor to a port is 611to use function 612.Va nm_open(..) 613(see 614.Xr LIBRARIES ) 615which parses names to access specific port types and 616enable features. 617In the following we document the main features. 618.Pp 619.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a 620.Em netmap pipe , 621consisting of two netmap ports with a crossover connection. 622A netmap pipe share the same memory space of the parent port, 623and is meant to enable configuration where a master process acts 624as a dispatcher towards slave processes. 625.Pp 626To enable this function, the 627.Pa nr_arg1 628field of the structure can be used as a hint to the kernel to 629indicate how many pipes we expect to use, and reserve extra space 630in the memory region. 631.Pp 632On return, it gives the same info as NIOCGINFO, 633with 634.Pa nr_ringid 635and 636.Pa nr_flags 637indicating the identity of the rings controlled through the file 638descriptor. 639.Pp 640.Va nr_flags 641.Va nr_ringid 642selects which rings are controlled through this file descriptor. 643Possible values of 644.Pa nr_flags 645are indicated below, together with the naming schemes 646that application libraries (such as the 647.Nm nm_open 648indicated below) can use to indicate the specific set of rings. 649In the example below, "netmap:foo" is any valid netmap port name. 650.Bl -tag -width XXXXX 651.It NR_REG_ALL_NIC "netmap:foo" 652(default) all hardware ring pairs 653.It NR_REG_SW "netmap:foo^" 654the ``host rings'', connecting to the host stack. 655.It NR_REG_NIC_SW "netmap:foo+" 656all hardware rings and the host rings 657.It NR_REG_ONE_NIC "netmap:foo-i" 658only the i-th hardware ring pair, where the number is in 659.Pa nr_ringid ; 660.It NR_REG_PIPE_MASTER "netmap:foo{i" 661the master side of the netmap pipe whose identifier (i) is in 662.Pa nr_ringid ; 663.It NR_REG_PIPE_SLAVE "netmap:foo}i" 664the slave side of the netmap pipe whose identifier (i) is in 665.Pa nr_ringid . 666.Pp 667The identifier of a pipe must be thought as part of the pipe name, 668and does not need to be sequential. 669On return the pipe 670will only have a single ring pair with index 0, 671irrespective of the value of 672.Va i. 673.El 674.Pp 675By default, a 676.Xr poll 2 677or 678.Xr select 2 679call pushes out any pending packets on the transmit ring, even if 680no write events are specified. 681The feature can be disabled by or-ing 682.Va NETMAP_NO_TX_POLL 683to the value written to 684.Va nr_ringid. 685When this feature is used, 686packets are transmitted only on 687.Va ioctl(NIOCTXSYNC) 688or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. 689.Pp 690When registering a virtual interface that is dynamically created to a 691.Xr vale 4 692switch, we can specify the desired number of rings (1 by default, 693and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 694.It Dv NIOCTXSYNC 695tells the hardware of new packets to transmit, and updates the 696number of slots available for transmission. 697.It Dv NIOCRXSYNC 698tells the hardware of consumed packets, and asks for newly available 699packets. 700.El 701.Sh SELECT, POLL, EPOLL, KQUEUE. 702.Xr select 2 703and 704.Xr poll 2 705on a 706.Nm 707file descriptor process rings as indicated in 708.Sx TRANSMIT RINGS 709and 710.Sx RECEIVE RINGS , 711respectively when write (POLLOUT) and read (POLLIN) events are requested. 712Both block if no slots are available in the ring 713.Va ( ring->cur == ring->tail ) . 714Depending on the platform, 715.Xr epoll 2 716and 717.Xr kqueue 2 718are supported too. 719.Pp 720Packets in transmit rings are normally pushed out 721(and buffers reclaimed) even without 722requesting write events. 723Passing the 724.Dv NETMAP_NO_TX_POLL 725flag to 726.Em NIOCREGIF 727disables this feature. 728By default, receive rings are processed only if read 729events are requested. 730Passing the 731.Dv NETMAP_DO_RX_POLL 732flag to 733.Em NIOCREGIF updates receive rings even without read events. 734Note that on epoll and kqueue, 735.Dv NETMAP_NO_TX_POLL 736and 737.Dv NETMAP_DO_RX_POLL 738only have an effect when some event is posted for the file descriptor. 739.Sh LIBRARIES 740The 741.Nm 742API is supposed to be used directly, both because of its simplicity and 743for efficient integration with applications. 744.Pp 745For convenience, the 746.In net/netmap_user.h 747header provides a few macros and functions to ease creating 748a file descriptor and doing I/O with a 749.Nm 750port. 751These are loosely modeled after the 752.Xr pcap 3 753API, to ease porting of libpcap-based applications to 754.Nm . 755To use these extra functions, programs should 756.Dl #define NETMAP_WITH_LIBS 757before 758.Dl #include <net/netmap_user.h> 759.Pp 760The following functions are available: 761.Bl -tag -width XXXXX 762.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) 763similar to 764.Xr pcap_open , 765binds a file descriptor to a port. 766.Bl -tag -width XX 767.It Va ifname 768is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a 769.Nm VALE 770port. 771.It Va req 772provides the initial values for the argument to the NIOCREGIF ioctl. 773The nm_flags and nm_ringid values are overwritten by parsing 774ifname and flags, and other fields can be overridden through 775the other two arguments. 776.It Va arg 777points to a struct nm_desc containing arguments (e.g. from a previously 778open file descriptor) that should override the defaults. 779The fields are used as described below 780.It Va flags 781can be set to a combination of the following flags: 782.Va NETMAP_NO_TX_POLL , 783.Va NETMAP_DO_RX_POLL 784(copied into nr_ringid); 785.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, 786avoids the mmap and uses the values from it); 787.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); 788.Va NM_OPEN_ARG1 , 789.Va NM_OPEN_ARG2 , 790.Va NM_OPEN_ARG3 (uses the fields from arg); 791.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). 792.El 793.It Va int nm_close(struct nm_desc *d) 794closes the file descriptor, unmaps memory, frees resources. 795.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) 796similar to pcap_inject(), pushes a packet to a ring, returns the size 797of the packet is successful, or 0 on error; 798.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) 799similar to pcap_dispatch(), applies a callback to incoming packets 800.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) 801similar to pcap_next(), fetches the next packet 802.El 803.Sh SUPPORTED DEVICES 804.Nm 805natively supports the following devices: 806.Pp 807On FreeBSD: 808.Xr cxgbe 4 , 809.Xr em 4 , 810.Xr igb 4 , 811.Xr ixgbe 4 , 812.Xr ixl 4 , 813.Xr lem 4 , 814.Xr re 4 . 815.Pp 816On Linux 817.Xr e1000 4 , 818.Xr e1000e 4 , 819.Xr i40e 4 , 820.Xr igb 4 , 821.Xr ixgbe 4 , 822.Xr r8169 4 . 823.Pp 824NICs without native support can still be used in 825.Nm 826mode through emulation. 827Performance is inferior to native netmap 828mode but still significantly higher than various raw socket types 829(bpf, PF_PACKET, etc.). 830Note that for slow devices (such as 1 Gbit/s and slower NICs, 831or several 10 Gbit/s NICs whose hardware is unable to sustain line rate), 832emulated and native mode will likely have similar or same throughput. 833.br 834When emulation is in use, packet sniffer programs such as tcpdump 835could see received packets before they are diverted by netmap. This behaviour 836is not intentional, being just an artifact of the implementation of emulation. 837Note that in case the netmap application subsequently moves packets received 838from the emulated adapter onto the host RX ring, the sniffer will intercept 839those packets again, since the packets are injected to the host stack as they 840were received by the network interface. 841.Pp 842Emulation is also available for devices with native netmap support, 843which can be used for testing or performance comparison. 844The sysctl variable 845.Va dev.netmap.admode 846globally controls how netmap mode is implemented. 847.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 848Some aspect of the operation of 849.Nm 850are controlled through sysctl variables on FreeBSD 851.Em ( dev.netmap.* ) 852and module parameters on Linux 853.Em ( /sys/module/netmap_lin/parameters/* ) : 854.Bl -tag -width indent 855.It Va dev.netmap.admode: 0 856Controls the use of native or emulated adapter mode. 857.br 8580 uses the best available option; 859.br 8601 forces native mode and fails if not available; 861.br 8622 forces emulated hence never fails. 863.It Va dev.netmap.generic_ringsize: 1024 864Ring size used for emulated netmap mode 865.It Va dev.netmap.generic_mit: 100000 866Controls interrupt moderation for emulated mode 867.It Va dev.netmap.mmap_unreg: 0 868.It Va dev.netmap.fwd: 0 869Forces NS_FORWARD mode 870.It Va dev.netmap.flags: 0 871.It Va dev.netmap.txsync_retry: 2 872.It Va dev.netmap.no_pendintr: 1 873Forces recovery of transmit buffers on system calls 874.It Va dev.netmap.mitigate: 1 875Propagates interrupt mitigation to user processes 876.It Va dev.netmap.no_timestamp: 0 877Disables the update of the timestamp in the netmap ring 878.It Va dev.netmap.verbose: 0 879Verbose kernel messages 880.It Va dev.netmap.buf_num: 163840 881.It Va dev.netmap.buf_size: 2048 882.It Va dev.netmap.ring_num: 200 883.It Va dev.netmap.ring_size: 36864 884.It Va dev.netmap.if_num: 100 885.It Va dev.netmap.if_size: 1024 886Sizes and number of objects (netmap_if, netmap_ring, buffers) 887for the global memory region. 888The only parameter worth modifying is 889.Va dev.netmap.buf_num 890as it impacts the total amount of memory used by netmap. 891.It Va dev.netmap.buf_curr_num: 0 892.It Va dev.netmap.buf_curr_size: 0 893.It Va dev.netmap.ring_curr_num: 0 894.It Va dev.netmap.ring_curr_size: 0 895.It Va dev.netmap.if_curr_num: 0 896.It Va dev.netmap.if_curr_size: 0 897Actual values in use. 898.It Va dev.netmap.bridge_batch: 1024 899Batch size used when moving packets across a 900.Nm VALE 901switch. 902Values above 64 generally guarantee good 903performance. 904.El 905.Sh SYSTEM CALLS 906.Nm 907uses 908.Xr select 2 , 909.Xr poll 2 , 910.Xr epoll 2 911and 912.Xr kqueue 2 913to wake up processes when significant events occur, and 914.Xr mmap 2 915to map memory. 916.Xr ioctl 2 917is used to configure ports and 918.Nm VALE switches . 919.Pp 920Applications may need to create threads and bind them to 921specific cores to improve performance, using standard 922OS primitives, see 923.Xr pthread 3 . 924In particular, 925.Xr pthread_setaffinity_np 3 926may be of use. 927.Sh EXAMPLES 928.Ss TEST PROGRAMS 929.Nm 930comes with a few programs that can be used for testing or 931simple applications. 932See the 933.Pa examples/ 934directory in 935.Nm 936distributions, or 937.Pa tools/tools/netmap/ 938directory in 939.Fx 940distributions. 941.Pp 942.Xr pkt-gen 943is a general purpose traffic source/sink. 944.Pp 945As an example 946.Dl pkt-gen -i ix0 -f tx -l 60 947can generate an infinite stream of minimum size packets, and 948.Dl pkt-gen -i ix0 -f rx 949is a traffic sink. 950Both print traffic statistics, to help monitor 951how the system performs. 952.Pp 953.Xr pkt-gen 954has many options can be uses to set packet sizes, addresses, 955rates, and use multiple send/receive threads and cores. 956.Pp 957.Xr bridge 958is another test program which interconnects two 959.Nm 960ports. 961It can be used for transparent forwarding between 962interfaces, as in 963.Dl bridge -i ix0 -i ix1 964or even connect the NIC to the host stack using netmap 965.Dl bridge -i ix0 -i ix0 966.Ss USING THE NATIVE API 967The following code implements a traffic generator 968.Pp 969.Bd -literal -compact 970#include <net/netmap_user.h> 971\&... 972void sender(void) 973{ 974 struct netmap_if *nifp; 975 struct netmap_ring *ring; 976 struct nmreq nmr; 977 struct pollfd fds; 978 979 fd = open("/dev/netmap", O_RDWR); 980 bzero(&nmr, sizeof(nmr)); 981 strcpy(nmr.nr_name, "ix0"); 982 nmr.nm_version = NETMAP_API; 983 ioctl(fd, NIOCREGIF, &nmr); 984 p = mmap(0, nmr.nr_memsize, fd); 985 nifp = NETMAP_IF(p, nmr.nr_offset); 986 ring = NETMAP_TXRING(nifp, 0); 987 fds.fd = fd; 988 fds.events = POLLOUT; 989 for (;;) { 990 poll(&fds, 1, -1); 991 while (!nm_ring_empty(ring)) { 992 i = ring->cur; 993 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 994 ... prepare packet in buf ... 995 ring->slot[i].len = ... packet length ... 996 ring->head = ring->cur = nm_ring_next(ring, i); 997 } 998 } 999} 1000.Ed 1001.Ss HELPER FUNCTIONS 1002A simple receiver can be implemented using the helper functions 1003.Bd -literal -compact 1004#define NETMAP_WITH_LIBS 1005#include <net/netmap_user.h> 1006\&... 1007void receiver(void) 1008{ 1009 struct nm_desc *d; 1010 struct pollfd fds; 1011 u_char *buf; 1012 struct nm_pkthdr h; 1013 ... 1014 d = nm_open("netmap:ix0", NULL, 0, 0); 1015 fds.fd = NETMAP_FD(d); 1016 fds.events = POLLIN; 1017 for (;;) { 1018 poll(&fds, 1, -1); 1019 while ( (buf = nm_nextpkt(d, &h)) ) 1020 consume_pkt(buf, h->len); 1021 } 1022 nm_close(d); 1023} 1024.Ed 1025.Ss ZERO-COPY FORWARDING 1026Since physical interfaces share the same memory region, 1027it is possible to do packet forwarding between ports 1028swapping buffers. 1029The buffer from the transmit ring is used 1030to replenish the receive ring: 1031.Bd -literal -compact 1032 uint32_t tmp; 1033 struct netmap_slot *src, *dst; 1034 ... 1035 src = &src_ring->slot[rxr->cur]; 1036 dst = &dst_ring->slot[txr->cur]; 1037 tmp = dst->buf_idx; 1038 dst->buf_idx = src->buf_idx; 1039 dst->len = src->len; 1040 dst->flags = NS_BUF_CHANGED; 1041 src->buf_idx = tmp; 1042 src->flags = NS_BUF_CHANGED; 1043 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 1044 txr->head = txr->cur = nm_ring_next(txr, txr->cur); 1045 ... 1046.Ed 1047.Ss ACCESSING THE HOST STACK 1048The host stack is for all practical purposes just a regular ring pair, 1049which you can access with the netmap API (e.g. with 1050.Dl nm_open("netmap:eth0^", ... ) ; 1051All packets that the host would send to an interface in 1052.Nm 1053mode end up into the RX ring, whereas all packets queued to the 1054TX ring are send up to the host stack. 1055.Ss VALE SWITCH 1056A simple way to test the performance of a 1057.Nm VALE 1058switch is to attach a sender and a receiver to it, 1059e.g. running the following in two different terminals: 1060.Dl pkt-gen -i vale1:a -f rx # receiver 1061.Dl pkt-gen -i vale1:b -f tx # sender 1062The same example can be used to test netmap pipes, by simply 1063changing port names, e.g. 1064.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side 1065.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side 1066.Pp 1067The following command attaches an interface and the host stack 1068to a switch: 1069.Dl vale-ctl -h vale2:em0 1070Other 1071.Nm 1072clients attached to the same switch can now communicate 1073with the network card or the host. 1074.Sh SEE ALSO 1075.Pa http://info.iet.unipi.it/~luigi/netmap/ 1076.Pp 1077Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 1078Communications of the ACM, 55 (3), pp.45-51, March 2012 1079.Pp 1080Luigi Rizzo, netmap: a novel framework for fast packet I/O, 1081Usenix ATC'12, June 2012, Boston 1082.Pp 1083Luigi Rizzo, Giuseppe Lettieri, 1084VALE, a switched ethernet for virtual machines, 1085ACM CoNEXT'12, December 2012, Nice 1086.Pp 1087Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, 1088Speeding up packet I/O in virtual machines, 1089ACM/IEEE ANCS'13, October 2013, San Jose 1090.Sh AUTHORS 1091.An -nosplit 1092The 1093.Nm 1094framework has been originally designed and implemented at the 1095Universita` di Pisa in 2011 by 1096.An Luigi Rizzo , 1097and further extended with help from 1098.An Matteo Landi , 1099.An Gaetano Catalli , 1100.An Giuseppe Lettieri , 1101and 1102.An Vincenzo Maffione . 1103.Pp 1104.Nm 1105and 1106.Nm VALE 1107have been funded by the European Commission within FP7 Projects 1108CHANGE (257422) and OPENLAB (287581). 1109.Sh CAVEATS 1110No matter how fast the CPU and OS are, 1111achieving line rate on 10G and faster interfaces 1112requires hardware with sufficient performance. 1113Several NICs are unable to sustain line rate with 1114small packet sizes. 1115Insufficient PCIe or memory bandwidth 1116can also cause reduced performance. 1117.Pp 1118Another frequent reason for low performance is the use 1119of flow control on the link: a slow receiver can limit 1120the transmit speed. 1121Be sure to disable flow control when running high 1122speed experiments. 1123.Ss SPECIAL NIC FEATURES 1124.Nm 1125is orthogonal to some NIC features such as 1126multiqueue, schedulers, packet filters. 1127.Pp 1128Multiple transmit and receive rings are supported natively 1129and can be configured with ordinary OS tools, 1130such as 1131.Xr ethtool 1132or 1133device-specific sysctl variables. 1134The same goes for Receive Packet Steering (RPS) 1135and filtering of incoming traffic. 1136.Pp 1137.Nm 1138.Em does not use 1139features such as 1140.Em checksum offloading , TCP segmentation offloading , 1141.Em encryption , VLAN encapsulation/decapsulation , 1142etc. 1143When using netmap to exchange packets with the host stack, 1144make sure to disable these features. 1145