1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd March 2, 2017 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.Nm VALE 37.Nd a fast VirtuAl Local Ethernet using the netmap API 38.Pp 39.Nm netmap pipes 40.Nd a shared memory packet transport channel 41.Sh SYNOPSIS 42.Cd device netmap 43.Sh DESCRIPTION 44.Nm 45is a framework for extremely fast and efficient packet I/O 46for userspace and kernel clients, and for Virtual Machines. 47It runs on 48.Fx 49Linux and some versions of Windows, and supports a variety of 50.Nm netmap ports , 51including 52.Bl -tag -width XXXX 53.It Nm physical NIC ports 54to access individual queues of network interfaces; 55.It Nm host ports 56to inject packets into the host stack; 57.It Nm VALE ports 58implementing a very fast and modular in-kernel software switch/dataplane; 59.It Nm netmap pipes 60a shared memory packet transport channel; 61.It Nm netmap monitors 62a mechanism similar to 63.Xr bpf 4 64to capture traffic 65.El 66.Pp 67All these 68.Nm netmap ports 69are accessed interchangeably with the same API, 70and are at least one order of magnitude faster than 71standard OS mechanisms 72(sockets, bpf, tun/tap interfaces, native switches, pipes). 73With suitably fast hardware (NICs, PCIe buses, CPUs), 74packet I/O using 75.Nm 76on supported NICs 77reaches 14.88 million packets per second (Mpps) 78with much less than one core on 10 Gbit/s NICs; 7935-40 Mpps on 40 Gbit/s NICs (limited by the hardware); 80about 20 Mpps per core for VALE ports; 81and over 100 Mpps for 82.Nm netmap pipes. 83NICs without native 84.Nm 85support can still use the API in emulated mode, 86which uses unmodified device drivers and is 3-5 times faster than 87.Xr bpf 4 88or raw sockets. 89.Pp 90Userspace clients can dynamically switch NICs into 91.Nm 92mode and send and receive raw packets through 93memory mapped buffers. 94Similarly, 95.Nm VALE 96switch instances and ports, 97.Nm netmap pipes 98and 99.Nm netmap monitors 100can be created dynamically, 101providing high speed packet I/O between processes, 102virtual machines, NICs and the host stack. 103.Pp 104.Nm 105supports both non-blocking I/O through 106.Xr ioctl 2 , 107synchronization and blocking I/O through a file descriptor 108and standard OS mechanisms such as 109.Xr select 2 , 110.Xr poll 2 , 111.Xr epoll 2 , 112and 113.Xr kqueue 2 . 114All types of 115.Nm netmap ports 116and the 117.Nm VALE switch 118are implemented by a single kernel module, which also emulates the 119.Nm 120API over standard drivers. 121For best performance, 122.Nm 123requires native support in device drivers. 124A list of such devices is at the end of this document. 125.Pp 126In the rest of this (long) manual page we document 127various aspects of the 128.Nm 129and 130.Nm VALE 131architecture, features and usage. 132.Sh ARCHITECTURE 133.Nm 134supports raw packet I/O through a 135.Em port , 136which can be connected to a physical interface 137.Em ( NIC ) , 138to the host stack, 139or to a 140.Nm VALE 141switch. 142Ports use preallocated circular queues of buffers 143.Em ( rings ) 144residing in an mmapped region. 145There is one ring for each transmit/receive queue of a 146NIC or virtual port. 147An additional ring pair connects to the host stack. 148.Pp 149After binding a file descriptor to a port, a 150.Nm 151client can send or receive packets in batches through 152the rings, and possibly implement zero-copy forwarding 153between ports. 154.Pp 155All NICs operating in 156.Nm 157mode use the same memory region, 158accessible to all processes who own 159.Pa /dev/netmap 160file descriptors bound to NICs. 161Independent 162.Nm VALE 163and 164.Nm netmap pipe 165ports 166by default use separate memory regions, 167but can be independently configured to share memory. 168.Sh ENTERING AND EXITING NETMAP MODE 169The following section describes the system calls to create 170and control 171.Nm netmap 172ports (including 173.Nm VALE 174and 175.Nm netmap pipe 176ports). 177Simpler, higher level functions are described in the 178.Sx LIBRARIES 179section. 180.Pp 181Ports and rings are created and controlled through a file descriptor, 182created by opening a special device 183.Dl fd = open("/dev/netmap"); 184and then bound to a specific port with an 185.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 186.Pp 187.Nm 188has multiple modes of operation controlled by the 189.Vt struct nmreq 190argument. 191.Va arg.nr_name 192specifies the netmap port name, as follows: 193.Bl -tag -width XXXX 194.It Dv OS network interface name (e.g., 'em0', 'eth1', ... ) 195the data path of the NIC is disconnected from the host stack, 196and the file descriptor is bound to the NIC (one or all queues), 197or to the host stack; 198.It Dv valeSSS:PPP 199the file descriptor is bound to port PPP of VALE switch SSS. 200Switch instances and ports are dynamically created if necessary. 201.Pp 202Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string 203cannot exceed IFNAMSIZ characters, and PPP cannot 204be the name of any existing OS network interface. 205.El 206.Pp 207On return, 208.Va arg 209indicates the size of the shared memory region, 210and the number, size and location of all the 211.Nm 212data structures, which can be accessed by mmapping the memory 213.Dl char *mem = mmap(0, arg.nr_memsize, fd); 214.Pp 215Non-blocking I/O is done with special 216.Xr ioctl 2 217.Xr select 2 218and 219.Xr poll 2 220on the file descriptor permit blocking I/O. 221.Xr epoll 2 222and 223.Xr kqueue 2 224are not supported on 225.Nm 226file descriptors. 227.Pp 228While a NIC is in 229.Nm 230mode, the OS will still believe the interface is up and running. 231OS-generated packets for that NIC end up into a 232.Nm 233ring, and another ring is used to send packets into the OS network stack. 234A 235.Xr close 2 236on the file descriptor removes the binding, 237and returns the NIC to normal mode (reconnecting the data path 238to the host stack), or destroys the virtual port. 239.Sh DATA STRUCTURES 240The data structures in the mmapped memory region are detailed in 241.In sys/net/netmap.h , 242which is the ultimate reference for the 243.Nm 244API. 245The main structures and fields are indicated below: 246.Bl -tag -width XXX 247.It Dv struct netmap_if (one per interface) 248.Bd -literal 249struct netmap_if { 250 ... 251 const uint32_t ni_flags; /* properties */ 252 ... 253 const uint32_t ni_tx_rings; /* NIC tx rings */ 254 const uint32_t ni_rx_rings; /* NIC rx rings */ 255 uint32_t ni_bufs_head; /* head of extra bufs list */ 256 ... 257}; 258.Ed 259.Pp 260Indicates the number of available rings 261.Pa ( struct netmap_rings ) 262and their position in the mmapped region. 263The number of tx and rx rings 264.Pa ( ni_tx_rings , ni_rx_rings ) 265normally depends on the hardware. 266NICs also have an extra tx/rx ring pair connected to the host stack. 267.Em NIOCREGIF 268can also request additional unbound buffers in the same memory space, 269to be used as temporary storage for packets. 270.Pa ni_bufs_head 271contains the index of the first of these free rings, 272which are connected in a list (the first uint32_t of each 273buffer being the index of the next buffer in the list). 274A 275.Dv 0 276indicates the end of the list. 277.It Dv struct netmap_ring (one per ring) 278.Bd -literal 279struct netmap_ring { 280 ... 281 const uint32_t num_slots; /* slots in each ring */ 282 const uint32_t nr_buf_size; /* size of each buffer */ 283 ... 284 uint32_t head; /* (u) first buf owned by user */ 285 uint32_t cur; /* (u) wakeup position */ 286 const uint32_t tail; /* (k) first buf owned by kernel */ 287 ... 288 uint32_t flags; 289 struct timeval ts; /* (k) time of last rxsync() */ 290 ... 291 struct netmap_slot slot[0]; /* array of slots */ 292} 293.Ed 294.Pp 295Implements transmit and receive rings, with read/write 296pointers, metadata and an array of 297.Em slots 298describing the buffers. 299.It Dv struct netmap_slot (one per buffer) 300.Bd -literal 301struct netmap_slot { 302 uint32_t buf_idx; /* buffer index */ 303 uint16_t len; /* packet length */ 304 uint16_t flags; /* buf changed, etc. */ 305 uint64_t ptr; /* address for indirect buffers */ 306}; 307.Ed 308.Pp 309Describes a packet buffer, which normally is identified by 310an index and resides in the mmapped region. 311.It Dv packet buffers 312Fixed size (normally 2 KB) packet buffers allocated by the kernel. 313.El 314.Pp 315The offset of the 316.Pa struct netmap_if 317in the mmapped region is indicated by the 318.Pa nr_offset 319field in the structure returned by 320.Dv NIOCREGIF . 321From there, all other objects are reachable through 322relative references (offsets or indexes). 323Macros and functions in 324.In net/netmap_user.h 325help converting them into actual pointers: 326.Pp 327.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 328.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 329.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 330.Pp 331.Dl char *buf = NETMAP_BUF(ring, buffer_index); 332.Sh RINGS, BUFFERS AND DATA I/O 333.Va Rings 334are circular queues of packets with three indexes/pointers 335.Va ( head , cur , tail ) ; 336one slot is always kept empty. 337The ring size 338.Va ( num_slots ) 339should not be assumed to be a power of two. 340.Pp 341.Va head 342is the first slot available to userspace; 343.Pp 344.Va cur 345is the wakeup point: 346select/poll will unblock when 347.Va tail 348passes 349.Va cur ; 350.Pp 351.Va tail 352is the first slot reserved to the kernel. 353.Pp 354Slot indexes 355.Em must 356only move forward; 357for convenience, the function 358.Dl nm_ring_next(ring, index) 359returns the next index modulo the ring size. 360.Pp 361.Va head 362and 363.Va cur 364are only modified by the user program; 365.Va tail 366is only modified by the kernel. 367The kernel only reads/writes the 368.Vt struct netmap_ring 369slots and buffers 370during the execution of a netmap-related system call. 371The only exception are slots (and buffers) in the range 372.Va tail\ . . . head-1 , 373that are explicitly assigned to the kernel. 374.Pp 375.Ss TRANSMIT RINGS 376On transmit rings, after a 377.Nm 378system call, slots in the range 379.Va head\ . . . tail-1 380are available for transmission. 381User code should fill the slots sequentially 382and advance 383.Va head 384and 385.Va cur 386past slots ready to transmit. 387.Va cur 388may be moved further ahead if the user code needs 389more slots before further transmissions (see 390.Sx SCATTER GATHER I/O ) . 391.Pp 392At the next NIOCTXSYNC/select()/poll(), 393slots up to 394.Va head-1 395are pushed to the port, and 396.Va tail 397may advance if further slots have become available. 398Below is an example of the evolution of a TX ring: 399.Bd -literal 400 after the syscall, slots between cur and tail are (a)vailable 401 head=cur tail 402 | | 403 v v 404 TX [.....aaaaaaaaaaa.............] 405 406 user creates new packets to (T)ransmit 407 head=cur tail 408 | | 409 v v 410 TX [.....TTTTTaaaaaa.............] 411 412 NIOCTXSYNC/poll()/select() sends packets and reports new slots 413 head=cur tail 414 | | 415 v v 416 TX [..........aaaaaaaaaaa........] 417.Ed 418.Pp 419.Fn select 420and 421.Fn poll 422will block if there is no space in the ring, i.e., 423.Dl ring->cur == ring->tail 424and return when new slots have become available. 425.Pp 426High speed applications may want to amortize the cost of system calls 427by preparing as many packets as possible before issuing them. 428.Pp 429A transmit ring with pending transmissions has 430.Dl ring->head != ring->tail + 1 (modulo the ring size). 431The function 432.Va int nm_tx_pending(ring) 433implements this test. 434.Ss RECEIVE RINGS 435On receive rings, after a 436.Nm 437system call, the slots in the range 438.Va head\& . . . tail-1 439contain received packets. 440User code should process them and advance 441.Va head 442and 443.Va cur 444past slots it wants to return to the kernel. 445.Va cur 446may be moved further ahead if the user code wants to 447wait for more packets 448without returning all the previous slots to the kernel. 449.Pp 450At the next NIOCRXSYNC/select()/poll(), 451slots up to 452.Va head-1 453are returned to the kernel for further receives, and 454.Va tail 455may advance to report new incoming packets. 456.Pp 457Below is an example of the evolution of an RX ring: 458.Bd -literal 459 after the syscall, there are some (h)eld and some (R)eceived slots 460 head cur tail 461 | | | 462 v v v 463 RX [..hhhhhhRRRRRRRR..........] 464 465 user advances head and cur, releasing some slots and holding others 466 head cur tail 467 | | | 468 v v v 469 RX [..*****hhhRRRRRR...........] 470 471 NICRXSYNC/poll()/select() recovers slots and reports new packets 472 head cur tail 473 | | | 474 v v v 475 RX [.......hhhRRRRRRRRRRRR....] 476.Ed 477.Sh SLOTS AND PACKET BUFFERS 478Normally, packets should be stored in the netmap-allocated buffers 479assigned to slots when ports are bound to a file descriptor. 480One packet is fully contained in a single buffer. 481.Pp 482The following flags affect slot and buffer processing: 483.Bl -tag -width XXX 484.It NS_BUF_CHANGED 485.Em must 486be used when the 487.Va buf_idx 488in the slot is changed. 489This can be used to implement 490zero-copy forwarding, see 491.Sx ZERO-COPY FORWARDING . 492.It NS_REPORT 493reports when this buffer has been transmitted. 494Normally, 495.Nm 496notifies transmit completions in batches, hence signals 497can be delayed indefinitely. 498This flag helps detect 499when packets have been sent and a file descriptor can be closed. 500.It NS_FORWARD 501When a ring is in 'transparent' mode (see 502.Sx TRANSPARENT MODE ) , 503packets marked with this flag are forwarded to the other endpoint 504at the next system call, thus restoring (in a selective way) 505the connection between a NIC and the host stack. 506.It NS_NO_LEARN 507tells the forwarding code that the source MAC address for this 508packet must not be used in the learning bridge code. 509.It NS_INDIRECT 510indicates that the packet's payload is in a user-supplied buffer 511whose user virtual address is in the 'ptr' field of the slot. 512The size can reach 65535 bytes. 513.Pp 514This is only supported on the transmit ring of 515.Nm VALE 516ports, and it helps reducing data copies in the interconnection 517of virtual machines. 518.It NS_MOREFRAG 519indicates that the packet continues with subsequent buffers; 520the last buffer in a packet must have the flag clear. 521.El 522.Sh SCATTER GATHER I/O 523Packets can span multiple slots if the 524.Va NS_MOREFRAG 525flag is set in all but the last slot. 526The maximum length of a chain is 64 buffers. 527This is normally used with 528.Nm VALE 529ports when connecting virtual machines, as they generate large 530TSO segments that are not split unless they reach a physical device. 531.Pp 532NOTE: The length field always refers to the individual 533fragment; there is no place with the total length of a packet. 534.Pp 535On receive rings the macro 536.Va NS_RFRAGS(slot) 537indicates the remaining number of slots for this packet, 538including the current one. 539Slots with a value greater than 1 also have NS_MOREFRAG set. 540.Sh IOCTLS 541.Nm 542uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 543for non-blocking I/O. 544They take no argument. 545Two more ioctls (NIOCGINFO, NIOCREGIF) are used 546to query and configure ports, with the following argument: 547.Bd -literal 548struct nmreq { 549 char nr_name[IFNAMSIZ]; /* (i) port name */ 550 uint32_t nr_version; /* (i) API version */ 551 uint32_t nr_offset; /* (o) nifp offset in mmap region */ 552 uint32_t nr_memsize; /* (o) size of the mmap region */ 553 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ 554 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ 555 uint16_t nr_tx_rings; /* (i/o) number of tx rings */ 556 uint16_t nr_rx_rings; /* (i/o) number of rx rings */ 557 uint16_t nr_ringid; /* (i/o) ring(s) we care about */ 558 uint16_t nr_cmd; /* (i) special command */ 559 uint16_t nr_arg1; /* (i/o) extra arguments */ 560 uint16_t nr_arg2; /* (i/o) extra arguments */ 561 uint32_t nr_arg3; /* (i/o) extra arguments */ 562 uint32_t nr_flags /* (i/o) open mode */ 563 ... 564}; 565.Ed 566.Pp 567A file descriptor obtained through 568.Pa /dev/netmap 569also supports the ioctl supported by network devices, see 570.Xr netintro 4 . 571.Bl -tag -width XXXX 572.It Dv NIOCGINFO 573returns EINVAL if the named port does not support netmap. 574Otherwise, it returns 0 and (advisory) information 575about the port. 576Note that all the information below can change before the 577interface is actually put in netmap mode. 578.Bl -tag -width XX 579.It Pa nr_memsize 580indicates the size of the 581.Nm 582memory region. 583NICs in 584.Nm 585mode all share the same memory region, 586whereas 587.Nm VALE 588ports have independent regions for each port. 589.It Pa nr_tx_slots , nr_rx_slots 590indicate the size of transmit and receive rings. 591.It Pa nr_tx_rings , nr_rx_rings 592indicate the number of transmit 593and receive rings. 594Both ring number and sizes may be configured at runtime 595using interface-specific functions (e.g., 596.Xr ethtool 8 597). 598.El 599.It Dv NIOCREGIF 600binds the port named in 601.Va nr_name 602to the file descriptor. 603For a physical device this also switches it into 604.Nm 605mode, disconnecting 606it from the host stack. 607Multiple file descriptors can be bound to the same port, 608with proper synchronization left to the user. 609.Pp 610The recommended way to bind a file descriptor to a port is 611to use function 612.Va nm_open(..) 613(see 614.Sx LIBRARIES ) 615which parses names to access specific port types and 616enable features. 617In the following we document the main features. 618.Pp 619.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a 620.Em netmap pipe , 621consisting of two netmap ports with a crossover connection. 622A netmap pipe share the same memory space of the parent port, 623and is meant to enable configuration where a master process acts 624as a dispatcher towards slave processes. 625.Pp 626To enable this function, the 627.Pa nr_arg1 628field of the structure can be used as a hint to the kernel to 629indicate how many pipes we expect to use, and reserve extra space 630in the memory region. 631.Pp 632On return, it gives the same info as NIOCGINFO, 633with 634.Pa nr_ringid 635and 636.Pa nr_flags 637indicating the identity of the rings controlled through the file 638descriptor. 639.Pp 640.Va nr_flags 641.Va nr_ringid 642selects which rings are controlled through this file descriptor. 643Possible values of 644.Pa nr_flags 645are indicated below, together with the naming schemes 646that application libraries (such as the 647.Nm nm_open 648indicated below) can use to indicate the specific set of rings. 649In the example below, "netmap:foo" is any valid netmap port name. 650.Bl -tag -width XXXXX 651.It NR_REG_ALL_NIC "netmap:foo" 652(default) all hardware ring pairs 653.It NR_REG_SW "netmap:foo^" 654the ``host rings'', connecting to the host stack. 655.It NR_REG_NIC_SW "netmap:foo+" 656all hardware rings and the host rings 657.It NR_REG_ONE_NIC "netmap:foo-i" 658only the i-th hardware ring pair, where the number is in 659.Pa nr_ringid ; 660.It NR_REG_PIPE_MASTER "netmap:foo{i" 661the master side of the netmap pipe whose identifier (i) is in 662.Pa nr_ringid ; 663.It NR_REG_PIPE_SLAVE "netmap:foo}i" 664the slave side of the netmap pipe whose identifier (i) is in 665.Pa nr_ringid . 666.Pp 667The identifier of a pipe must be thought as part of the pipe name, 668and does not need to be sequential. 669On return the pipe 670will only have a single ring pair with index 0, 671irrespective of the value of 672.Va i. 673.El 674.Pp 675By default, a 676.Xr poll 2 677or 678.Xr select 2 679call pushes out any pending packets on the transmit ring, even if 680no write events are specified. 681The feature can be disabled by or-ing 682.Va NETMAP_NO_TX_POLL 683to the value written to 684.Va nr_ringid. 685When this feature is used, 686packets are transmitted only on 687.Va ioctl(NIOCTXSYNC) 688or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. 689.Pp 690When registering a virtual interface that is dynamically created to a 691.Xr vale 4 692switch, we can specify the desired number of rings (1 by default, 693and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 694.It Dv NIOCTXSYNC 695tells the hardware of new packets to transmit, and updates the 696number of slots available for transmission. 697.It Dv NIOCRXSYNC 698tells the hardware of consumed packets, and asks for newly available 699packets. 700.El 701.Sh SELECT, POLL, EPOLL, KQUEUE. 702.Xr select 2 703and 704.Xr poll 2 705on a 706.Nm 707file descriptor process rings as indicated in 708.Sx TRANSMIT RINGS 709and 710.Sx RECEIVE RINGS , 711respectively when write (POLLOUT) and read (POLLIN) events are requested. 712Both block if no slots are available in the ring 713.Va ( ring->cur == ring->tail ) . 714Depending on the platform, 715.Xr epoll 2 716and 717.Xr kqueue 2 718are supported too. 719.Pp 720Packets in transmit rings are normally pushed out 721(and buffers reclaimed) even without 722requesting write events. 723Passing the 724.Dv NETMAP_NO_TX_POLL 725flag to 726.Em NIOCREGIF 727disables this feature. 728By default, receive rings are processed only if read 729events are requested. 730Passing the 731.Dv NETMAP_DO_RX_POLL 732flag to 733.Em NIOCREGIF updates receive rings even without read events. 734Note that on epoll and kqueue, 735.Dv NETMAP_NO_TX_POLL 736and 737.Dv NETMAP_DO_RX_POLL 738only have an effect when some event is posted for the file descriptor. 739.Sh LIBRARIES 740The 741.Nm 742API is supposed to be used directly, both because of its simplicity and 743for efficient integration with applications. 744.Pp 745For convenience, the 746.In net/netmap_user.h 747header provides a few macros and functions to ease creating 748a file descriptor and doing I/O with a 749.Nm 750port. 751These are loosely modeled after the 752.Xr pcap 3 753API, to ease porting of libpcap-based applications to 754.Nm . 755To use these extra functions, programs should 756.Dl #define NETMAP_WITH_LIBS 757before 758.Dl #include <net/netmap_user.h> 759.Pp 760The following functions are available: 761.Bl -tag -width XXXXX 762.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) 763similar to 764.Xr pcap_open 3pcap , 765binds a file descriptor to a port. 766.Bl -tag -width XX 767.It Va ifname 768is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a 769.Nm VALE 770port. 771.It Va req 772provides the initial values for the argument to the NIOCREGIF ioctl. 773The nm_flags and nm_ringid values are overwritten by parsing 774ifname and flags, and other fields can be overridden through 775the other two arguments. 776.It Va arg 777points to a struct nm_desc containing arguments (e.g., from a previously 778open file descriptor) that should override the defaults. 779The fields are used as described below 780.It Va flags 781can be set to a combination of the following flags: 782.Va NETMAP_NO_TX_POLL , 783.Va NETMAP_DO_RX_POLL 784(copied into nr_ringid); 785.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, 786avoids the mmap and uses the values from it); 787.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); 788.Va NM_OPEN_ARG1 , 789.Va NM_OPEN_ARG2 , 790.Va NM_OPEN_ARG3 (uses the fields from arg); 791.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). 792.El 793.It Va int nm_close(struct nm_desc *d) 794closes the file descriptor, unmaps memory, frees resources. 795.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) 796similar to pcap_inject(), pushes a packet to a ring, returns the size 797of the packet is successful, or 0 on error; 798.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) 799similar to pcap_dispatch(), applies a callback to incoming packets 800.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) 801similar to pcap_next(), fetches the next packet 802.El 803.Sh SUPPORTED DEVICES 804.Nm 805natively supports the following devices: 806.Pp 807On FreeBSD: 808.Xr cxgbe 4 , 809.Xr em 4 , 810.Xr igb 4 , 811.Xr ixgbe 4 , 812.Xr ixl 4 , 813.Xr lem 4 , 814.Xr re 4 . 815.Pp 816On Linux 817.Xr e1000 4 , 818.Xr e1000e 4 , 819.Xr i40e 4 , 820.Xr igb 4 , 821.Xr ixgbe 4 , 822.Xr r8169 4 . 823.Pp 824NICs without native support can still be used in 825.Nm 826mode through emulation. 827Performance is inferior to native netmap 828mode but still significantly higher than various raw socket types 829(bpf, PF_PACKET, etc.). 830Note that for slow devices (such as 1 Gbit/s and slower NICs, 831or several 10 Gbit/s NICs whose hardware is unable to sustain line rate), 832emulated and native mode will likely have similar or same throughput. 833.Pp 834When emulation is in use, packet sniffer programs such as tcpdump 835could see received packets before they are diverted by netmap. 836This behaviour is not intentional, being just an artifact of the implementation 837of emulation. 838Note that in case the netmap application subsequently moves packets received 839from the emulated adapter onto the host RX ring, the sniffer will intercept 840those packets again, since the packets are injected to the host stack as they 841were received by the network interface. 842.Pp 843Emulation is also available for devices with native netmap support, 844which can be used for testing or performance comparison. 845The sysctl variable 846.Va dev.netmap.admode 847globally controls how netmap mode is implemented. 848.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 849Some aspect of the operation of 850.Nm 851are controlled through sysctl variables on FreeBSD 852.Em ( dev.netmap.* ) 853and module parameters on Linux 854.Em ( /sys/module/netmap_lin/parameters/* ) : 855.Bl -tag -width indent 856.It Va dev.netmap.admode: 0 857Controls the use of native or emulated adapter mode. 858.Pp 8590 uses the best available option; 860.Pp 8611 forces native mode and fails if not available; 862.Pp 8632 forces emulated hence never fails. 864.It Va dev.netmap.generic_ringsize: 1024 865Ring size used for emulated netmap mode 866.It Va dev.netmap.generic_mit: 100000 867Controls interrupt moderation for emulated mode 868.It Va dev.netmap.mmap_unreg: 0 869.It Va dev.netmap.fwd: 0 870Forces NS_FORWARD mode 871.It Va dev.netmap.flags: 0 872.It Va dev.netmap.txsync_retry: 2 873.It Va dev.netmap.no_pendintr: 1 874Forces recovery of transmit buffers on system calls 875.It Va dev.netmap.mitigate: 1 876Propagates interrupt mitigation to user processes 877.It Va dev.netmap.no_timestamp: 0 878Disables the update of the timestamp in the netmap ring 879.It Va dev.netmap.verbose: 0 880Verbose kernel messages 881.It Va dev.netmap.buf_num: 163840 882.It Va dev.netmap.buf_size: 2048 883.It Va dev.netmap.ring_num: 200 884.It Va dev.netmap.ring_size: 36864 885.It Va dev.netmap.if_num: 100 886.It Va dev.netmap.if_size: 1024 887Sizes and number of objects (netmap_if, netmap_ring, buffers) 888for the global memory region. 889The only parameter worth modifying is 890.Va dev.netmap.buf_num 891as it impacts the total amount of memory used by netmap. 892.It Va dev.netmap.buf_curr_num: 0 893.It Va dev.netmap.buf_curr_size: 0 894.It Va dev.netmap.ring_curr_num: 0 895.It Va dev.netmap.ring_curr_size: 0 896.It Va dev.netmap.if_curr_num: 0 897.It Va dev.netmap.if_curr_size: 0 898Actual values in use. 899.It Va dev.netmap.bridge_batch: 1024 900Batch size used when moving packets across a 901.Nm VALE 902switch. 903Values above 64 generally guarantee good 904performance. 905.El 906.Sh SYSTEM CALLS 907.Nm 908uses 909.Xr select 2 , 910.Xr poll 2 , 911.Xr epoll 2 912and 913.Xr kqueue 2 914to wake up processes when significant events occur, and 915.Xr mmap 2 916to map memory. 917.Xr ioctl 2 918is used to configure ports and 919.Nm VALE switches . 920.Pp 921Applications may need to create threads and bind them to 922specific cores to improve performance, using standard 923OS primitives, see 924.Xr pthread 3 . 925In particular, 926.Xr pthread_setaffinity_np 3 927may be of use. 928.Sh EXAMPLES 929.Ss TEST PROGRAMS 930.Nm 931comes with a few programs that can be used for testing or 932simple applications. 933See the 934.Pa examples/ 935directory in 936.Nm 937distributions, or 938.Pa tools/tools/netmap/ 939directory in 940.Fx 941distributions. 942.Pp 943.Xr pkt-gen 8 944is a general purpose traffic source/sink. 945.Pp 946As an example 947.Dl pkt-gen -i ix0 -f tx -l 60 948can generate an infinite stream of minimum size packets, and 949.Dl pkt-gen -i ix0 -f rx 950is a traffic sink. 951Both print traffic statistics, to help monitor 952how the system performs. 953.Pp 954.Xr pkt-gen 8 955has many options can be uses to set packet sizes, addresses, 956rates, and use multiple send/receive threads and cores. 957.Pp 958.Xr bridge 4 959is another test program which interconnects two 960.Nm 961ports. 962It can be used for transparent forwarding between 963interfaces, as in 964.Dl bridge -i ix0 -i ix1 965or even connect the NIC to the host stack using netmap 966.Dl bridge -i ix0 -i ix0 967.Ss USING THE NATIVE API 968The following code implements a traffic generator 969.Pp 970.Bd -literal -compact 971#include <net/netmap_user.h> 972\&... 973void sender(void) 974{ 975 struct netmap_if *nifp; 976 struct netmap_ring *ring; 977 struct nmreq nmr; 978 struct pollfd fds; 979 980 fd = open("/dev/netmap", O_RDWR); 981 bzero(&nmr, sizeof(nmr)); 982 strcpy(nmr.nr_name, "ix0"); 983 nmr.nm_version = NETMAP_API; 984 ioctl(fd, NIOCREGIF, &nmr); 985 p = mmap(0, nmr.nr_memsize, fd); 986 nifp = NETMAP_IF(p, nmr.nr_offset); 987 ring = NETMAP_TXRING(nifp, 0); 988 fds.fd = fd; 989 fds.events = POLLOUT; 990 for (;;) { 991 poll(&fds, 1, -1); 992 while (!nm_ring_empty(ring)) { 993 i = ring->cur; 994 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 995 ... prepare packet in buf ... 996 ring->slot[i].len = ... packet length ... 997 ring->head = ring->cur = nm_ring_next(ring, i); 998 } 999 } 1000} 1001.Ed 1002.Ss HELPER FUNCTIONS 1003A simple receiver can be implemented using the helper functions 1004.Bd -literal -compact 1005#define NETMAP_WITH_LIBS 1006#include <net/netmap_user.h> 1007\&... 1008void receiver(void) 1009{ 1010 struct nm_desc *d; 1011 struct pollfd fds; 1012 u_char *buf; 1013 struct nm_pkthdr h; 1014 ... 1015 d = nm_open("netmap:ix0", NULL, 0, 0); 1016 fds.fd = NETMAP_FD(d); 1017 fds.events = POLLIN; 1018 for (;;) { 1019 poll(&fds, 1, -1); 1020 while ( (buf = nm_nextpkt(d, &h)) ) 1021 consume_pkt(buf, h->len); 1022 } 1023 nm_close(d); 1024} 1025.Ed 1026.Ss ZERO-COPY FORWARDING 1027Since physical interfaces share the same memory region, 1028it is possible to do packet forwarding between ports 1029swapping buffers. 1030The buffer from the transmit ring is used 1031to replenish the receive ring: 1032.Bd -literal -compact 1033 uint32_t tmp; 1034 struct netmap_slot *src, *dst; 1035 ... 1036 src = &src_ring->slot[rxr->cur]; 1037 dst = &dst_ring->slot[txr->cur]; 1038 tmp = dst->buf_idx; 1039 dst->buf_idx = src->buf_idx; 1040 dst->len = src->len; 1041 dst->flags = NS_BUF_CHANGED; 1042 src->buf_idx = tmp; 1043 src->flags = NS_BUF_CHANGED; 1044 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 1045 txr->head = txr->cur = nm_ring_next(txr, txr->cur); 1046 ... 1047.Ed 1048.Ss ACCESSING THE HOST STACK 1049The host stack is for all practical purposes just a regular ring pair, 1050which you can access with the netmap API (e.g., with 1051.Dl nm_open("netmap:eth0^", ... ) ; 1052All packets that the host would send to an interface in 1053.Nm 1054mode end up into the RX ring, whereas all packets queued to the 1055TX ring are send up to the host stack. 1056.Ss VALE SWITCH 1057A simple way to test the performance of a 1058.Nm VALE 1059switch is to attach a sender and a receiver to it, 1060e.g., running the following in two different terminals: 1061.Dl pkt-gen -i vale1:a -f rx # receiver 1062.Dl pkt-gen -i vale1:b -f tx # sender 1063The same example can be used to test netmap pipes, by simply 1064changing port names, e.g., 1065.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side 1066.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side 1067.Pp 1068The following command attaches an interface and the host stack 1069to a switch: 1070.Dl vale-ctl -h vale2:em0 1071Other 1072.Nm 1073clients attached to the same switch can now communicate 1074with the network card or the host. 1075.Sh SEE ALSO 1076.Pa http://info.iet.unipi.it/~luigi/netmap/ 1077.Pp 1078Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 1079Communications of the ACM, 55 (3), pp.45-51, March 2012 1080.Pp 1081Luigi Rizzo, netmap: a novel framework for fast packet I/O, 1082Usenix ATC'12, June 2012, Boston 1083.Pp 1084Luigi Rizzo, Giuseppe Lettieri, 1085VALE, a switched ethernet for virtual machines, 1086ACM CoNEXT'12, December 2012, Nice 1087.Pp 1088Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, 1089Speeding up packet I/O in virtual machines, 1090ACM/IEEE ANCS'13, October 2013, San Jose 1091.Sh AUTHORS 1092.An -nosplit 1093The 1094.Nm 1095framework has been originally designed and implemented at the 1096Universita` di Pisa in 2011 by 1097.An Luigi Rizzo , 1098and further extended with help from 1099.An Matteo Landi , 1100.An Gaetano Catalli , 1101.An Giuseppe Lettieri , 1102and 1103.An Vincenzo Maffione . 1104.Pp 1105.Nm 1106and 1107.Nm VALE 1108have been funded by the European Commission within FP7 Projects 1109CHANGE (257422) and OPENLAB (287581). 1110.Sh CAVEATS 1111No matter how fast the CPU and OS are, 1112achieving line rate on 10G and faster interfaces 1113requires hardware with sufficient performance. 1114Several NICs are unable to sustain line rate with 1115small packet sizes. 1116Insufficient PCIe or memory bandwidth 1117can also cause reduced performance. 1118.Pp 1119Another frequent reason for low performance is the use 1120of flow control on the link: a slow receiver can limit 1121the transmit speed. 1122Be sure to disable flow control when running high 1123speed experiments. 1124.Ss SPECIAL NIC FEATURES 1125.Nm 1126is orthogonal to some NIC features such as 1127multiqueue, schedulers, packet filters. 1128.Pp 1129Multiple transmit and receive rings are supported natively 1130and can be configured with ordinary OS tools, 1131such as 1132.Xr ethtool 8 1133or 1134device-specific sysctl variables. 1135The same goes for Receive Packet Steering (RPS) 1136and filtering of incoming traffic. 1137.Pp 1138.Nm 1139.Em does not use 1140features such as 1141.Em checksum offloading , TCP segmentation offloading , 1142.Em encryption , VLAN encapsulation/decapsulation , 1143etc. 1144When using netmap to exchange packets with the host stack, 1145make sure to disable these features. 1146