1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd February 13, 2014 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.br 37.Nm VALE 38.Nd a fast VirtuAl Local Ethernet using the netmap API 39.br 40.Nm netmap pipes 41.Nd a shared memory packet transport channel 42.Sh SYNOPSIS 43.Cd device netmap 44.Sh DESCRIPTION 45.Nm 46is a framework for extremely fast and efficient packet I/O 47for both userspace and kernel clients. 48It runs on FreeBSD and Linux, 49and includes 50.Nm VALE , 51a very fast and modular in-kernel software switch/dataplane, 52and 53.Nm netmap pipes , 54a shared memory packet transport channel. 55All these are accessed interchangeably with the same API. 56.Pp 57.Nm , VALE 58and 59.Nm netmap pipes 60are at least one order of magnitude faster than 61standard OS mechanisms 62(sockets, bpf, tun/tap interfaces, native switches, pipes), 63reaching 14.88 million packets per second (Mpps) 64with much less than one core on a 10 Gbit NIC, 65about 20 Mpps per core for VALE ports, 66and over 100 Mpps for netmap pipes. 67.Pp 68Userspace clients can dynamically switch NICs into 69.Nm 70mode and send and receive raw packets through 71memory mapped buffers. 72Similarly, 73.Nm VALE 74switch instances and ports, and 75.Nm netmap pipes 76can be created dynamically, 77providing high speed packet I/O between processes, 78virtual machines, NICs and the host stack. 79.Pp 80.Nm 81suports both non-blocking I/O through 82.Xr ioctls() , 83synchronization and blocking I/O through a file descriptor 84and standard OS mechanisms such as 85.Xr select 2 , 86.Xr poll 2 , 87.Xr epoll 2 , 88.Xr kqueue 2 . 89.Nm VALE 90and 91.Nm netmap pipes 92are implemented by a single kernel module, which also emulates the 93.Nm 94API over standard drivers for devices without native 95.Nm 96support. 97For best performance, 98.Nm 99requires explicit support in device drivers. 100.Pp 101In the rest of this (long) manual page we document 102various aspects of the 103.Nm 104and 105.Nm VALE 106architecture, features and usage. 107.Pp 108.Sh ARCHITECTURE 109.Nm 110supports raw packet I/O through a 111.Em port , 112which can be connected to a physical interface 113.Em ( NIC ) , 114to the host stack, 115or to a 116.Nm VALE 117switch). 118Ports use preallocated circular queues of buffers 119.Em ( rings ) 120residing in an mmapped region. 121There is one ring for each transmit/receive queue of a 122NIC or virtual port. 123An additional ring pair connects to the host stack. 124.Pp 125After binding a file descriptor to a port, a 126.Nm 127client can send or receive packets in batches through 128the rings, and possibly implement zero-copy forwarding 129between ports. 130.Pp 131All NICs operating in 132.Nm 133mode use the same memory region, 134accessible to all processes who own 135.Nm /dev/netmap 136file descriptors bound to NICs. 137Independent 138.Nm VALE 139and 140.Nm netmap pipe 141ports 142by default use separate memory regions, 143but can be independently configured to share memory. 144.Pp 145.Sh ENTERING AND EXITING NETMAP MODE 146The following section describes the system calls to create 147and control 148.Nm netmap 149ports (including 150.Nm VALE 151and 152.Nm netmap pipe 153ports). 154Simpler, higher level functions are described in section 155.Xr LIBRARIES . 156.Pp 157Ports and rings are created and controlled through a file descriptor, 158created by opening a special device 159.Dl fd = open("/dev/netmap"); 160and then bound to a specific port with an 161.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 162.Pp 163.Nm 164has multiple modes of operation controlled by the 165.Vt struct nmreq 166argument. 167.Va arg.nr_name 168specifies the port name, as follows: 169.Bl -tag -width XXXX 170.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) 171the data path of the NIC is disconnected from the host stack, 172and the file descriptor is bound to the NIC (one or all queues), 173or to the host stack; 174.It Dv valeXXX:YYY (arbitrary XXX and YYY) 175the file descriptor is bound to port YYY of a VALE switch called XXX, 176both dynamically created if necessary. 177The string cannot exceed IFNAMSIZ characters, and YYY cannot 178be the name of any existing OS network interface. 179.El 180.Pp 181On return, 182.Va arg 183indicates the size of the shared memory region, 184and the number, size and location of all the 185.Nm 186data structures, which can be accessed by mmapping the memory 187.Dl char *mem = mmap(0, arg.nr_memsize, fd); 188.Pp 189Non blocking I/O is done with special 190.Xr ioctl 2 191.Xr select 2 192and 193.Xr poll 2 194on the file descriptor permit blocking I/O. 195.Xr epoll 2 196and 197.Xr kqueue 2 198are not supported on 199.Nm 200file descriptors. 201.Pp 202While a NIC is in 203.Nm 204mode, the OS will still believe the interface is up and running. 205OS-generated packets for that NIC end up into a 206.Nm 207ring, and another ring is used to send packets into the OS network stack. 208A 209.Xr close 2 210on the file descriptor removes the binding, 211and returns the NIC to normal mode (reconnecting the data path 212to the host stack), or destroys the virtual port. 213.Pp 214.Sh DATA STRUCTURES 215The data structures in the mmapped memory region are detailed in 216.Xr sys/net/netmap.h , 217which is the ultimate reference for the 218.Nm 219API. The main structures and fields are indicated below: 220.Bl -tag -width XXX 221.It Dv struct netmap_if (one per interface) 222.Bd -literal 223struct netmap_if { 224 ... 225 const uint32_t ni_flags; /* properties */ 226 ... 227 const uint32_t ni_tx_rings; /* NIC tx rings */ 228 const uint32_t ni_rx_rings; /* NIC rx rings */ 229 uint32_t ni_bufs_head; /* head of extra bufs list */ 230 ... 231}; 232.Ed 233.Pp 234Indicates the number of available rings 235.Pa ( struct netmap_rings ) 236and their position in the mmapped region. 237The number of tx and rx rings 238.Pa ( ni_tx_rings , ni_rx_rings ) 239normally depends on the hardware. 240NICs also have an extra tx/rx ring pair connected to the host stack. 241.Em NIOCREGIF 242can also request additional unbound buffers in the same memory space, 243to be used as temporary storage for packets. 244.Pa ni_bufs_head 245contains the index of the first of these free rings, 246which are connected in a list (the first uint32_t of each 247buffer being the index of the next buffer in the list). 248A 0 indicates the end of the list. 249.Pp 250.It Dv struct netmap_ring (one per ring) 251.Bd -literal 252struct netmap_ring { 253 ... 254 const uint32_t num_slots; /* slots in each ring */ 255 const uint32_t nr_buf_size; /* size of each buffer */ 256 ... 257 uint32_t head; /* (u) first buf owned by user */ 258 uint32_t cur; /* (u) wakeup position */ 259 const uint32_t tail; /* (k) first buf owned by kernel */ 260 ... 261 uint32_t flags; 262 struct timeval ts; /* (k) time of last rxsync() */ 263 ... 264 struct netmap_slot slot[0]; /* array of slots */ 265} 266.Ed 267.Pp 268Implements transmit and receive rings, with read/write 269pointers, metadata and and an array of 270.Pa slots 271describing the buffers. 272.Pp 273.It Dv struct netmap_slot (one per buffer) 274.Bd -literal 275struct netmap_slot { 276 uint32_t buf_idx; /* buffer index */ 277 uint16_t len; /* packet length */ 278 uint16_t flags; /* buf changed, etc. */ 279 uint64_t ptr; /* address for indirect buffers */ 280}; 281.Ed 282.Pp 283Describes a packet buffer, which normally is identified by 284an index and resides in the mmapped region. 285.It Dv packet buffers 286Fixed size (normally 2 KB) packet buffers allocated by the kernel. 287.El 288.Pp 289The offset of the 290.Pa struct netmap_if 291in the mmapped region is indicated by the 292.Pa nr_offset 293field in the structure returned by 294.Pa NIOCREGIF . 295From there, all other objects are reachable through 296relative references (offsets or indexes). 297Macros and functions in <net/netmap_user.h> 298help converting them into actual pointers: 299.Pp 300.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 301.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 302.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 303.Pp 304.Dl char *buf = NETMAP_BUF(ring, buffer_index); 305.Sh RINGS, BUFFERS AND DATA I/O 306.Va Rings 307are circular queues of packets with three indexes/pointers 308.Va ( head , cur , tail ) ; 309one slot is always kept empty. 310The ring size 311.Va ( num_slots ) 312should not be assumed to be a power of two. 313.br 314(NOTE: older versions of netmap used head/count format to indicate 315the content of a ring). 316.Pp 317.Va head 318is the first slot available to userspace; 319.br 320.Va cur 321is the wakeup point: 322select/poll will unblock when 323.Va tail 324passes 325.Va cur ; 326.br 327.Va tail 328is the first slot reserved to the kernel. 329.Pp 330Slot indexes MUST only move forward; 331for convenience, the function 332.Dl nm_ring_next(ring, index) 333returns the next index modulo the ring size. 334.Pp 335.Va head 336and 337.Va cur 338are only modified by the user program; 339.Va tail 340is only modified by the kernel. 341The kernel only reads/writes the 342.Vt struct netmap_ring 343slots and buffers 344during the execution of a netmap-related system call. 345The only exception are slots (and buffers) in the range 346.Va tail\ . . . head-1 , 347that are explicitly assigned to the kernel. 348.Pp 349.Ss TRANSMIT RINGS 350On transmit rings, after a 351.Nm 352system call, slots in the range 353.Va head\ . . . tail-1 354are available for transmission. 355User code should fill the slots sequentially 356and advance 357.Va head 358and 359.Va cur 360past slots ready to transmit. 361.Va cur 362may be moved further ahead if the user code needs 363more slots before further transmissions (see 364.Sx SCATTER GATHER I/O ) . 365.Pp 366At the next NIOCTXSYNC/select()/poll(), 367slots up to 368.Va head-1 369are pushed to the port, and 370.Va tail 371may advance if further slots have become available. 372Below is an example of the evolution of a TX ring: 373.Pp 374.Bd -literal 375 after the syscall, slots between cur and tail are (a)vailable 376 head=cur tail 377 | | 378 v v 379 TX [.....aaaaaaaaaaa.............] 380 381 user creates new packets to (T)ransmit 382 head=cur tail 383 | | 384 v v 385 TX [.....TTTTTaaaaaa.............] 386 387 NIOCTXSYNC/poll()/select() sends packets and reports new slots 388 head=cur tail 389 | | 390 v v 391 TX [..........aaaaaaaaaaa........] 392.Ed 393.Pp 394select() and poll() wlll block if there is no space in the ring, i.e. 395.Dl ring->cur == ring->tail 396and return when new slots have become available. 397.Pp 398High speed applications may want to amortize the cost of system calls 399by preparing as many packets as possible before issuing them. 400.Pp 401A transmit ring with pending transmissions has 402.Dl ring->head != ring->tail + 1 (modulo the ring size). 403The function 404.Va int nm_tx_pending(ring) 405implements this test. 406.Pp 407.Ss RECEIVE RINGS 408On receive rings, after a 409.Nm 410system call, the slots in the range 411.Va head\& . . . tail-1 412contain received packets. 413User code should process them and advance 414.Va head 415and 416.Va cur 417past slots it wants to return to the kernel. 418.Va cur 419may be moved further ahead if the user code wants to 420wait for more packets 421without returning all the previous slots to the kernel. 422.Pp 423At the next NIOCRXSYNC/select()/poll(), 424slots up to 425.Va head-1 426are returned to the kernel for further receives, and 427.Va tail 428may advance to report new incoming packets. 429.br 430Below is an example of the evolution of an RX ring: 431.Bd -literal 432 after the syscall, there are some (h)eld and some (R)eceived slots 433 head cur tail 434 | | | 435 v v v 436 RX [..hhhhhhRRRRRRRR..........] 437 438 user advances head and cur, releasing some slots and holding others 439 head cur tail 440 | | | 441 v v v 442 RX [..*****hhhRRRRRR...........] 443 444 NICRXSYNC/poll()/select() recovers slots and reports new packets 445 head cur tail 446 | | | 447 v v v 448 RX [.......hhhRRRRRRRRRRRR....] 449.Ed 450.Pp 451.Sh SLOTS AND PACKET BUFFERS 452Normally, packets should be stored in the netmap-allocated buffers 453assigned to slots when ports are bound to a file descriptor. 454One packet is fully contained in a single buffer. 455.Pp 456The following flags affect slot and buffer processing: 457.Bl -tag -width XXX 458.It NS_BUF_CHANGED 459it MUST be used when the buf_idx in the slot is changed. 460This can be used to implement 461zero-copy forwarding, see 462.Sx ZERO-COPY FORWARDING . 463.Pp 464.It NS_REPORT 465reports when this buffer has been transmitted. 466Normally, 467.Nm 468notifies transmit completions in batches, hence signals 469can be delayed indefinitely. This flag helps detecting 470when packets have been send and a file descriptor can be closed. 471.It NS_FORWARD 472When a ring is in 'transparent' mode (see 473.Sx TRANSPARENT MODE ) , 474packets marked with this flags are forwarded to the other endpoint 475at the next system call, thus restoring (in a selective way) 476the connection between a NIC and the host stack. 477.It NS_NO_LEARN 478tells the forwarding code that the SRC MAC address for this 479packet must not be used in the learning bridge code. 480.It NS_INDIRECT 481indicates that the packet's payload is in a user-supplied buffer, 482whose user virtual address is in the 'ptr' field of the slot. 483The size can reach 65535 bytes. 484.br 485This is only supported on the transmit ring of 486.Nm VALE 487ports, and it helps reducing data copies in the interconnection 488of virtual machines. 489.It NS_MOREFRAG 490indicates that the packet continues with subsequent buffers; 491the last buffer in a packet must have the flag clear. 492.El 493.Sh SCATTER GATHER I/O 494Packets can span multiple slots if the 495.Va NS_MOREFRAG 496flag is set in all but the last slot. 497The maximum length of a chain is 64 buffers. 498This is normally used with 499.Nm VALE 500ports when connecting virtual machines, as they generate large 501TSO segments that are not split unless they reach a physical device. 502.Pp 503NOTE: The length field always refers to the individual 504fragment; there is no place with the total length of a packet. 505.Pp 506On receive rings the macro 507.Va NS_RFRAGS(slot) 508indicates the remaining number of slots for this packet, 509including the current one. 510Slots with a value greater than 1 also have NS_MOREFRAG set. 511.Sh IOCTLS 512.Nm 513uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 514for non-blocking I/O. They take no argument. 515Two more ioctls (NIOCGINFO, NIOCREGIF) are used 516to query and configure ports, with the following argument: 517.Bd -literal 518struct nmreq { 519 char nr_name[IFNAMSIZ]; /* (i) port name */ 520 uint32_t nr_version; /* (i) API version */ 521 uint32_t nr_offset; /* (o) nifp offset in mmap region */ 522 uint32_t nr_memsize; /* (o) size of the mmap region */ 523 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ 524 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ 525 uint16_t nr_tx_rings; /* (i/o) number of tx rings */ 526 uint16_t nr_rx_rings; /* (i/o) number of tx rings */ 527 uint16_t nr_ringid; /* (i/o) ring(s) we care about */ 528 uint16_t nr_cmd; /* (i) special command */ 529 uint16_t nr_arg1; /* (i/o) extra arguments */ 530 uint16_t nr_arg2; /* (i/o) extra arguments */ 531 uint32_t nr_arg3; /* (i/o) extra arguments */ 532 uint32_t nr_flags /* (i/o) open mode */ 533 ... 534}; 535.Ed 536.Pp 537A file descriptor obtained through 538.Pa /dev/netmap 539also supports the ioctl supported by network devices, see 540.Xr netintro 4 . 541.Pp 542.Bl -tag -width XXXX 543.It Dv NIOCGINFO 544returns EINVAL if the named port does not support netmap. 545Otherwise, it returns 0 and (advisory) information 546about the port. 547Note that all the information below can change before the 548interface is actually put in netmap mode. 549.Pp 550.Bl -tag -width XX 551.It Pa nr_memsize 552indicates the size of the 553.Nm 554memory region. NICs in 555.Nm 556mode all share the same memory region, 557whereas 558.Nm VALE 559ports have independent regions for each port. 560.It Pa nr_tx_slots , nr_rx_slots 561indicate the size of transmit and receive rings. 562.It Pa nr_tx_rings , nr_rx_rings 563indicate the number of transmit 564and receive rings. 565Both ring number and sizes may be configured at runtime 566using interface-specific functions (e.g. 567.Xr ethtool 568). 569.El 570.It Dv NIOCREGIF 571binds the port named in 572.Va nr_name 573to the file descriptor. For a physical device this also switches it into 574.Nm 575mode, disconnecting 576it from the host stack. 577Multiple file descriptors can be bound to the same port, 578with proper synchronization left to the user. 579.Pp 580.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a 581.Em netmap pipe , 582consisting of two netmap ports with a crossover connection. 583A netmap pipe share the same memory space of the parent port, 584and is meant to enable configuration where a master process acts 585as a dispatcher towards slave processes. 586.Pp 587To enable this function, the 588.Pa nr_arg1 589field of the structure can be used as a hint to the kernel to 590indicate how many pipes we expect to use, and reserve extra space 591in the memory region. 592.Pp 593On return, it gives the same info as NIOCGINFO, 594with 595.Pa nr_ringid 596and 597.Pa nr_flags 598indicating the identity of the rings controlled through the file 599descriptor. 600.Pp 601.Va nr_flags 602.Va nr_ringid 603selects which rings are controlled through this file descriptor. 604Possible values of 605.Pa nr_flags 606are indicated below, together with the naming schemes 607that application libraries (such as the 608.Nm nm_open 609indicated below) can use to indicate the specific set of rings. 610In the example below, "netmap:foo" is any valid netmap port name. 611.Pp 612.Bl -tag -width XXXXX 613.It NR_REG_ALL_NIC "netmap:foo" 614(default) all hardware ring pairs 615.It NR_REG_SW_NIC "netmap:foo^" 616the ``host rings'', connecting to the host stack. 617.It NR_RING_NIC_SW "netmap:foo+ 618all hardware rings and the host rings 619.It NR_REG_ONE_NIC "netmap:foo-i" 620only the i-th hardware ring pair, where the number is in 621.Pa nr_ringid ; 622.It NR_REG_PIPE_MASTER "netmap:foo{i" 623the master side of the netmap pipe whose identifier (i) is in 624.Pa nr_ringid ; 625.It NR_REG_PIPE_SLAVE "netmap:foo}i" 626the slave side of the netmap pipe whose identifier (i) is in 627.Pa nr_ringid . 628.Pp 629The identifier of a pipe must be thought as part of the pipe name, 630and does not need to be sequential. On return the pipe 631will only have a single ring pair with index 0, 632irrespective of the value of i. 633.El 634.Pp 635By default, a 636.Xr poll 2 637or 638.Xr select 2 639call pushes out any pending packets on the transmit ring, even if 640no write events are specified. 641The feature can be disabled by or-ing 642.Va NETMAP_NO_TX_SYNC 643to the value written to 644.Va nr_ringid. 645When this feature is used, 646packets are transmitted only on 647.Va ioctl(NIOCTXSYNC) 648or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. 649.Pp 650When registering a virtual interface that is dynamically created to a 651.Xr vale 4 652switch, we can specify the desired number of rings (1 by default, 653and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 654.It Dv NIOCTXSYNC 655tells the hardware of new packets to transmit, and updates the 656number of slots available for transmission. 657.It Dv NIOCRXSYNC 658tells the hardware of consumed packets, and asks for newly available 659packets. 660.El 661.Sh SELECT, POLL, EPOLL, KQUEUE. 662.Xr select 2 663and 664.Xr poll 2 665on a 666.Nm 667file descriptor process rings as indicated in 668.Sx TRANSMIT RINGS 669and 670.Sx RECEIVE RINGS , 671respectively when write (POLLOUT) and read (POLLIN) events are requested. 672Both block if no slots are available in the ring 673.Va ( ring->cur == ring->tail ) . 674Depending on the platform, 675.Xr epoll 2 676and 677.Xr kqueue 2 678are supported too. 679.Pp 680Packets in transmit rings are normally pushed out 681(and buffers reclaimed) even without 682requesting write events. Passing the NETMAP_NO_TX_SYNC flag to 683.Em NIOCREGIF 684disables this feature. 685By default, receive rings are processed only if read 686events are requested. Passing the NETMAP_DO_RX_SYNC flag to 687.Em NIOCREGIF updates receive rings even without read events. 688Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC 689only have an effect when some event is posted for the file descriptor. 690.Sh LIBRARIES 691The 692.Nm 693API is supposed to be used directly, both because of its simplicity and 694for efficient integration with applications. 695.Pp 696For conveniency, the 697.Va <net/netmap_user.h> 698header provides a few macros and functions to ease creating 699a file descriptor and doing I/O with a 700.Nm 701port. These are loosely modeled after the 702.Xr pcap 3 703API, to ease porting of libpcap-based applications to 704.Nm . 705To use these extra functions, programs should 706.Dl #define NETMAP_WITH_LIBS 707before 708.Dl #include <net/netmap_user.h> 709.Pp 710The following functions are available: 711.Bl -tag -width XXXXX 712.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) 713similar to 714.Xr pcap_open , 715binds a file descriptor to a port. 716.Bl -tag -width XX 717.It Va ifname 718is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a 719.Nm VALE 720port. 721.It Va req 722provides the initial values for the argument to the NIOCREGIF ioctl. 723The nm_flags and nm_ringid values are overwritten by parsing 724ifname and flags, and other fields can be overridden through 725the other two arguments. 726.It Va arg 727points to a struct nm_desc containing arguments (e.g. from a previously 728open file descriptor) that should override the defaults. 729The fields are used as described below 730.It Va flags 731can be set to a combination of the following flags: 732.Va NETMAP_NO_TX_POLL , 733.Va NETMAP_DO_RX_POLL 734(copied into nr_ringid); 735.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, 736avoids the mmap and uses the values from it); 737.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); 738.Va NM_OPEN_ARG1 , 739.Va NM_OPEN_ARG2 , 740.Va NM_OPEN_ARG3 (uses the fields from arg); 741.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). 742.El 743.It Va int nm_close(struct nm_desc *d) 744closes the file descriptor, unmaps memory, frees resources. 745.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) 746similar to pcap_inject(), pushes a packet to a ring, returns the size 747of the packet is successful, or 0 on error; 748.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) 749similar to pcap_dispatch(), applies a callback to incoming packets 750.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) 751similar to pcap_next(), fetches the next packet 752.Pp 753.El 754.Sh SUPPORTED DEVICES 755.Nm 756natively supports the following devices: 757.Pp 758On FreeBSD: 759.Xr em 4 , 760.Xr igb 4 , 761.Xr ixgbe 4 , 762.Xr lem 4 , 763.Xr re 4 . 764.Pp 765On Linux 766.Xr e1000 4 , 767.Xr e1000e 4 , 768.Xr igb 4 , 769.Xr ixgbe 4 , 770.Xr mlx4 4 , 771.Xr forcedeth 4 , 772.Xr r8169 4 . 773.Pp 774NICs without native support can still be used in 775.Nm 776mode through emulation. Performance is inferior to native netmap 777mode but still significantly higher than sockets, and approaching 778that of in-kernel solutions such as Linux's 779.Xr pktgen . 780.Pp 781Emulation is also available for devices with native netmap support, 782which can be used for testing or performance comparison. 783The sysctl variable 784.Va dev.netmap.admode 785globally controls how netmap mode is implemented. 786.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 787Some aspect of the operation of 788.Nm 789are controlled through sysctl variables on FreeBSD 790.Em ( dev.netmap.* ) 791and module parameters on Linux 792.Em ( /sys/module/netmap_lin/parameters/* ) : 793.Pp 794.Bl -tag -width indent 795.It Va dev.netmap.admode: 0 796Controls the use of native or emulated adapter mode. 7970 uses the best available option, 1 forces native and 798fails if not available, 2 forces emulated hence never fails. 799.It Va dev.netmap.generic_ringsize: 1024 800Ring size used for emulated netmap mode 801.It Va dev.netmap.generic_mit: 100000 802Controls interrupt moderation for emulated mode 803.It Va dev.netmap.mmap_unreg: 0 804.It Va dev.netmap.fwd: 0 805Forces NS_FORWARD mode 806.It Va dev.netmap.flags: 0 807.It Va dev.netmap.txsync_retry: 2 808.It Va dev.netmap.no_pendintr: 1 809Forces recovery of transmit buffers on system calls 810.It Va dev.netmap.mitigate: 1 811Propagates interrupt mitigation to user processes 812.It Va dev.netmap.no_timestamp: 0 813Disables the update of the timestamp in the netmap ring 814.It Va dev.netmap.verbose: 0 815Verbose kernel messages 816.It Va dev.netmap.buf_num: 163840 817.It Va dev.netmap.buf_size: 2048 818.It Va dev.netmap.ring_num: 200 819.It Va dev.netmap.ring_size: 36864 820.It Va dev.netmap.if_num: 100 821.It Va dev.netmap.if_size: 1024 822Sizes and number of objects (netmap_if, netmap_ring, buffers) 823for the global memory region. The only parameter worth modifying is 824.Va dev.netmap.buf_num 825as it impacts the total amount of memory used by netmap. 826.It Va dev.netmap.buf_curr_num: 0 827.It Va dev.netmap.buf_curr_size: 0 828.It Va dev.netmap.ring_curr_num: 0 829.It Va dev.netmap.ring_curr_size: 0 830.It Va dev.netmap.if_curr_num: 0 831.It Va dev.netmap.if_curr_size: 0 832Actual values in use. 833.It Va dev.netmap.bridge_batch: 1024 834Batch size used when moving packets across a 835.Nm VALE 836switch. Values above 64 generally guarantee good 837performance. 838.El 839.Sh SYSTEM CALLS 840.Nm 841uses 842.Xr select 2 , 843.Xr poll 2 , 844.Xr epoll 845and 846.Xr kqueue 847to wake up processes when significant events occur, and 848.Xr mmap 2 849to map memory. 850.Xr ioctl 2 851is used to configure ports and 852.Nm VALE switches . 853.Pp 854Applications may need to create threads and bind them to 855specific cores to improve performance, using standard 856OS primitives, see 857.Xr pthread 3 . 858In particular, 859.Xr pthread_setaffinity_np 3 860may be of use. 861.Sh CAVEATS 862No matter how fast the CPU and OS are, 863achieving line rate on 10G and faster interfaces 864requires hardware with sufficient performance. 865Several NICs are unable to sustain line rate with 866small packet sizes. Insufficient PCIe or memory bandwidth 867can also cause reduced performance. 868.Pp 869Another frequent reason for low performance is the use 870of flow control on the link: a slow receiver can limit 871the transmit speed. 872Be sure to disable flow control when running high 873speed experiments. 874.Pp 875.Ss SPECIAL NIC FEATURES 876.Nm 877is orthogonal to some NIC features such as 878multiqueue, schedulers, packet filters. 879.Pp 880Multiple transmit and receive rings are supported natively 881and can be configured with ordinary OS tools, 882such as 883.Xr ethtool 884or 885device-specific sysctl variables. 886The same goes for Receive Packet Steering (RPS) 887and filtering of incoming traffic. 888.Pp 889.Nm 890.Em does not use 891features such as 892.Em checksum offloading , TCP segmentation offloading , 893.Em encryption , VLAN encapsulation/decapsulation , 894etc. . 895When using netmap to exchange packets with the host stack, 896make sure to disable these features. 897.Sh EXAMPLES 898.Ss TEST PROGRAMS 899.Nm 900comes with a few programs that can be used for testing or 901simple applications. 902See the 903.Va examples/ 904directory in 905.Nm 906distributions, or 907.Va tools/tools/netmap/ 908directory in FreeBSD distributions. 909.Pp 910.Xr pkt-gen 911is a general purpose traffic source/sink. 912.Pp 913As an example 914.Dl pkt-gen -i ix0 -f tx -l 60 915can generate an infinite stream of minimum size packets, and 916.Dl pkt-gen -i ix0 -f rx 917is a traffic sink. 918Both print traffic statistics, to help monitor 919how the system performs. 920.Pp 921.Xr pkt-gen 922has many options can be uses to set packet sizes, addresses, 923rates, and use multiple send/receive threads and cores. 924.Pp 925.Xr bridge 926is another test program which interconnects two 927.Nm 928ports. It can be used for transparent forwarding between 929interfaces, as in 930.Dl bridge -i ix0 -i ix1 931or even connect the NIC to the host stack using netmap 932.Dl bridge -i ix0 -i ix0 933.Ss USING THE NATIVE API 934The following code implements a traffic generator 935.Pp 936.Bd -literal -compact 937#include <net/netmap_user.h> 938... 939void sender(void) 940{ 941 struct netmap_if *nifp; 942 struct netmap_ring *ring; 943 struct nmreq nmr; 944 struct pollfd fds; 945 946 fd = open("/dev/netmap", O_RDWR); 947 bzero(&nmr, sizeof(nmr)); 948 strcpy(nmr.nr_name, "ix0"); 949 nmr.nm_version = NETMAP_API; 950 ioctl(fd, NIOCREGIF, &nmr); 951 p = mmap(0, nmr.nr_memsize, fd); 952 nifp = NETMAP_IF(p, nmr.nr_offset); 953 ring = NETMAP_TXRING(nifp, 0); 954 fds.fd = fd; 955 fds.events = POLLOUT; 956 for (;;) { 957 poll(&fds, 1, -1); 958 while (!nm_ring_empty(ring)) { 959 i = ring->cur; 960 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 961 ... prepare packet in buf ... 962 ring->slot[i].len = ... packet length ... 963 ring->head = ring->cur = nm_ring_next(ring, i); 964 } 965 } 966} 967.Ed 968.Ss HELPER FUNCTIONS 969A simple receiver can be implemented using the helper functions 970.Bd -literal -compact 971#define NETMAP_WITH_LIBS 972#include <net/netmap_user.h> 973... 974void receiver(void) 975{ 976 struct nm_desc *d; 977 struct pollfd fds; 978 u_char *buf; 979 struct nm_pkthdr h; 980 ... 981 d = nm_open("netmap:ix0", NULL, 0, 0); 982 fds.fd = NETMAP_FD(d); 983 fds.events = POLLIN; 984 for (;;) { 985 poll(&fds, 1, -1); 986 while ( (buf = nm_nextpkt(d, &h)) ) 987 consume_pkt(buf, h->len); 988 } 989 nm_close(d); 990} 991.Ed 992.Ss ZERO-COPY FORWARDING 993Since physical interfaces share the same memory region, 994it is possible to do packet forwarding between ports 995swapping buffers. The buffer from the transmit ring is used 996to replenish the receive ring: 997.Bd -literal -compact 998 uint32_t tmp; 999 struct netmap_slot *src, *dst; 1000 ... 1001 src = &src_ring->slot[rxr->cur]; 1002 dst = &dst_ring->slot[txr->cur]; 1003 tmp = dst->buf_idx; 1004 dst->buf_idx = src->buf_idx; 1005 dst->len = src->len; 1006 dst->flags = NS_BUF_CHANGED; 1007 src->buf_idx = tmp; 1008 src->flags = NS_BUF_CHANGED; 1009 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 1010 txr->head = txr->cur = nm_ring_next(txr, txr->cur); 1011 ... 1012.Ed 1013.Ss ACCESSING THE HOST STACK 1014The host stack is for all practical purposes just a regular ring pair, 1015which you can access with the netmap API (e.g. with 1016.Dl nm_open("netmap:eth0^", ... ) ; 1017All packets that the host would send to an interface in 1018.Nm 1019mode end up into the RX ring, whereas all packets queued to the 1020TX ring are send up to the host stack. 1021.Ss VALE SWITCH 1022A simple way to test the performance of a 1023.Nm VALE 1024switch is to attach a sender and a receiver to it, 1025e.g. running the following in two different terminals: 1026.Dl pkt-gen -i vale1:a -f rx # receiver 1027.Dl pkt-gen -i vale1:b -f tx # sender 1028The same example can be used to test netmap pipes, by simply 1029changing port names, e.g. 1030.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side 1031.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side 1032.Pp 1033The following command attaches an interface and the host stack 1034to a switch: 1035.Dl vale-ctl -h vale2:em0 1036Other 1037.Nm 1038clients attached to the same switch can now communicate 1039with the network card or the host. 1040.Pp 1041.Sh SEE ALSO 1042.Pp 1043http://info.iet.unipi.it/~luigi/netmap/ 1044.Pp 1045Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 1046Communications of the ACM, 55 (3), pp.45-51, March 2012 1047.Pp 1048Luigi Rizzo, netmap: a novel framework for fast packet I/O, 1049Usenix ATC'12, June 2012, Boston 1050.Pp 1051Luigi Rizzo, Giuseppe Lettieri, 1052VALE, a switched ethernet for virtual machines, 1053ACM CoNEXT'12, December 2012, Nice 1054.Pp 1055Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, 1056Speeding up packet I/O in virtual machines, 1057ACM/IEEE ANCS'13, October 2013, San Jose 1058.Sh AUTHORS 1059.An -nosplit 1060The 1061.Nm 1062framework has been originally designed and implemented at the 1063Universita` di Pisa in 2011 by 1064.An Luigi Rizzo , 1065and further extended with help from 1066.An Matteo Landi , 1067.An Gaetano Catalli , 1068.An Giuseppe Lettieri , 1069.An Vincenzo Maffione . 1070.Pp 1071.Nm 1072and 1073.Nm VALE 1074have been funded by the European Commission within FP7 Projects 1075CHANGE (257422) and OPENLAB (287581). 1076