1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd January 4, 2014 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.br 37.Nm VALE 38.Nd a fast VirtuAl Local Ethernet using the netmap API 39.Sh SYNOPSIS 40.Cd device netmap 41.Sh DESCRIPTION 42.Nm 43is a framework for extremely fast and efficient packet I/O 44for both userspace and kernel clients. 45It runs on FreeBSD and Linux, 46and includes 47.Nm VALE , 48a very fast and modular in-kernel software switch/dataplane. 49.Pp 50.Nm 51and 52.Nm VALE 53are one order of magnitude faster than sockets, bpf or 54native switches based on 55.Xr tun/tap 4 , 56reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC, 57and 20 Mpps per core for VALE ports. 58.Pp 59Userspace clients can dynamically switch NICs into 60.Nm 61mode and send and receive raw packets through 62memory mapped buffers. 63A selectable file descriptor supports 64synchronization and blocking I/O. 65.Pp 66Similarly, 67.Nm VALE 68can dynamically create switch instances and ports, 69providing high speed packet I/O between processes, 70virtual machines, NICs and the host stack. 71.Pp 72For best performance, 73.Nm 74requires explicit support in device drivers; 75however, the 76.Nm 77API can be emulated on top of unmodified device drivers, 78at the price of reduced performance 79(but still better than sockets or BPF/pcap). 80.Pp 81In the rest of this (long) manual page we document 82various aspects of the 83.Nm 84and 85.Nm VALE 86architecture, features and usage. 87.Pp 88.Sh ARCHITECTURE 89.Nm 90supports raw packet I/O through a 91.Em port , 92which can be connected to a physical interface 93.Em ( NIC ) , 94to the host stack, 95or to a 96.Nm VALE 97switch). 98Ports use preallocated circular queues of buffers 99.Em ( rings ) 100residing in an mmapped region. 101There is one ring for each transmit/receive queue of a 102NIC or virtual port. 103An additional ring pair connects to the host stack. 104.Pp 105After binding a file descriptor to a port, a 106.Nm 107client can send or receive packets in batches through 108the rings, and possibly implement zero-copy forwarding 109between ports. 110.Pp 111All NICs operating in 112.Nm 113mode use the same memory region, 114accessible to all processes who own 115.Nm /dev/netmap 116file descriptors bound to NICs. 117.Nm VALE 118ports instead use separate memory regions. 119.Pp 120.Sh ENTERING AND EXITING NETMAP MODE 121Ports and rings are created and controlled through a file descriptor, 122created by opening a special device 123.Dl fd = open("/dev/netmap"); 124and then bound to a specific port with an 125.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 126.Pp 127.Nm 128has multiple modes of operation controlled by the 129.Vt struct nmreq 130argument. 131.Va arg.nr_name 132specifies the port name, as follows: 133.Bl -tag -width XXXX 134.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) 135the data path of the NIC is disconnected from the host stack, 136and the file descriptor is bound to the NIC (one or all queues), 137or to the host stack; 138.It Dv valeXXX:YYY (arbitrary XXX and YYY) 139the file descriptor is bound to port YYY of a VALE switch called XXX, 140both dynamically created if necessary. 141The string cannot exceed IFNAMSIZ characters, and YYY cannot 142be the name of any existing OS network interface. 143.El 144.Pp 145On return, 146.Va arg 147indicates the size of the shared memory region, 148and the number, size and location of all the 149.Nm 150data structures, which can be accessed by mmapping the memory 151.Dl char *mem = mmap(0, arg.nr_memsize, fd); 152.Pp 153Non blocking I/O is done with special 154.Xr ioctl 2 155.Xr select 2 156and 157.Xr poll 2 158on the file descriptor permit blocking I/O. 159.Xr epoll 2 160and 161.Xr kqueue 2 162are not supported on 163.Nm 164file descriptors. 165.Pp 166While a NIC is in 167.Nm 168mode, the OS will still believe the interface is up and running. 169OS-generated packets for that NIC end up into a 170.Nm 171ring, and another ring is used to send packets into the OS network stack. 172A 173.Xr close 2 174on the file descriptor removes the binding, 175and returns the NIC to normal mode (reconnecting the data path 176to the host stack), or destroys the virtual port. 177.Pp 178.Sh DATA STRUCTURES 179The data structures in the mmapped memory region are detailed in 180.Xr sys/net/netmap.h , 181which is the ultimate reference for the 182.Nm 183API. The main structures and fields are indicated below: 184.Bl -tag -width XXX 185.It Dv struct netmap_if (one per interface) 186.Bd -literal 187struct netmap_if { 188 ... 189 const uint32_t ni_flags; /* properties */ 190 ... 191 const uint32_t ni_tx_rings; /* NIC tx rings */ 192 const uint32_t ni_rx_rings; /* NIC rx rings */ 193 const uint32_t ni_extra_tx_rings; /* extra tx rings */ 194 const uint32_t ni_extra_rx_rings; /* extra rx rings */ 195 ... 196}; 197.Ed 198.Pp 199Indicates the number of available rings 200.Pa ( struct netmap_rings ) 201and their position in the mmapped region. 202The number of tx and rx rings 203.Pa ( ni_tx_rings , ni_rx_rings ) 204normally depends on the hardware. 205NICs also have an extra tx/rx ring pair connected to the host stack. 206.Em NIOCREGIF 207can request additional tx/rx rings, 208to be used between multiple processes/threads 209accessing the same 210.Nm 211port. 212.It Dv struct netmap_ring (one per ring) 213.Bd -literal 214struct netmap_ring { 215 ... 216 const uint32_t num_slots; /* slots in each ring */ 217 const uint32_t nr_buf_size; /* size of each buffer */ 218 ... 219 uint32_t head; /* (u) first buf owned by user */ 220 uint32_t cur; /* (u) wakeup position */ 221 const uint32_t tail; /* (k) first buf owned by kernel */ 222 ... 223 uint32_t flags; 224 struct timeval ts; /* (k) time of last rxsync() */ 225 ... 226 struct netmap_slot slot[0]; /* array of slots */ 227} 228.Ed 229.Pp 230Implements transmit and receive rings, with read/write 231pointers, metadata and and an array of 232.Pa slots 233describing the buffers. 234.Pp 235.It Dv struct netmap_slot (one per buffer) 236.Bd -literal 237struct netmap_slot { 238 uint32_t buf_idx; /* buffer index */ 239 uint16_t len; /* packet length */ 240 uint16_t flags; /* buf changed, etc. */ 241 uint64_t ptr; /* address for indirect buffers */ 242}; 243.Ed 244.Pp 245Describes a packet buffer, which normally is identified by 246an index and resides in the mmapped region. 247.It Dv packet buffers 248Fixed size (normally 2 KB) packet buffers allocated by the kernel. 249.El 250.Pp 251The offset of the 252.Pa struct netmap_if 253in the mmapped region is indicated by the 254.Pa nr_offset 255field in the structure returned by 256.Pa NIOCREGIF . 257From there, all other objects are reachable through 258relative references (offsets or indexes). 259Macros and functions in <net/netmap_user.h> 260help converting them into actual pointers: 261.Pp 262.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 263.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 264.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 265.Pp 266.Dl char *buf = NETMAP_BUF(ring, buffer_index); 267.Sh RINGS, BUFFERS AND DATA I/O 268.Va Rings 269are circular queues of packets with three indexes/pointers 270.Va ( head , cur , tail ) ; 271one slot is always kept empty. 272The ring size 273.Va ( num_slots ) 274should not be assumed to be a power of two. 275.br 276(NOTE: older versions of netmap used head/count format to indicate 277the content of a ring). 278.Pp 279.Va head 280is the first slot available to userspace; 281.br 282.Va cur 283is the wakeup point: 284select/poll will unblock when 285.Va tail 286passes 287.Va cur ; 288.br 289.Va tail 290is the first slot reserved to the kernel. 291.Pp 292Slot indexes MUST only move forward; 293for convenience, the function 294.Dl nm_ring_next(ring, index) 295returns the next index modulo the ring size. 296.Pp 297.Va head 298and 299.Va cur 300are only modified by the user program; 301.Va tail 302is only modified by the kernel. 303The kernel only reads/writes the 304.Vt struct netmap_ring 305slots and buffers 306during the execution of a netmap-related system call. 307The only exception are slots (and buffers) in the range 308.Va tail\ . . . head-1 , 309that are explicitly assigned to the kernel. 310.Pp 311.Ss TRANSMIT RINGS 312On transmit rings, after a 313.Nm 314system call, slots in the range 315.Va head\ . . . tail-1 316are available for transmission. 317User code should fill the slots sequentially 318and advance 319.Va head 320and 321.Va cur 322past slots ready to transmit. 323.Va cur 324may be moved further ahead if the user code needs 325more slots before further transmissions (see 326.Sx SCATTER GATHER I/O ) . 327.Pp 328At the next NIOCTXSYNC/select()/poll(), 329slots up to 330.Va head-1 331are pushed to the port, and 332.Va tail 333may advance if further slots have become available. 334Below is an example of the evolution of a TX ring: 335.Pp 336.Bd -literal 337 after the syscall, slots between cur and tail are (a)vailable 338 head=cur tail 339 | | 340 v v 341 TX [.....aaaaaaaaaaa.............] 342 343 user creates new packets to (T)ransmit 344 head=cur tail 345 | | 346 v v 347 TX [.....TTTTTaaaaaa.............] 348 349 NIOCTXSYNC/poll()/select() sends packets and reports new slots 350 head=cur tail 351 | | 352 v v 353 TX [..........aaaaaaaaaaa........] 354.Ed 355.Pp 356select() and poll() wlll block if there is no space in the ring, i.e. 357.Dl ring->cur == ring->tail 358and return when new slots have become available. 359.Pp 360High speed applications may want to amortize the cost of system calls 361by preparing as many packets as possible before issuing them. 362.Pp 363A transmit ring with pending transmissions has 364.Dl ring->head != ring->tail + 1 (modulo the ring size). 365The function 366.Va int nm_tx_pending(ring) 367implements this test. 368.Pp 369.Ss RECEIVE RINGS 370On receive rings, after a 371.Nm 372system call, the slots in the range 373.Va head\& . . . tail-1 374contain received packets. 375User code should process them and advance 376.Va head 377and 378.Va cur 379past slots it wants to return to the kernel. 380.Va cur 381may be moved further ahead if the user code wants to 382wait for more packets 383without returning all the previous slots to the kernel. 384.Pp 385At the next NIOCRXSYNC/select()/poll(), 386slots up to 387.Va head-1 388are returned to the kernel for further receives, and 389.Va tail 390may advance to report new incoming packets. 391.br 392Below is an example of the evolution of an RX ring: 393.Bd -literal 394 after the syscall, there are some (h)eld and some (R)eceived slots 395 head cur tail 396 | | | 397 v v v 398 RX [..hhhhhhRRRRRRRR..........] 399 400 user advances head and cur, releasing some slots and holding others 401 head cur tail 402 | | | 403 v v v 404 RX [..*****hhhRRRRRR...........] 405 406 NICRXSYNC/poll()/select() recovers slots and reports new packets 407 head cur tail 408 | | | 409 v v v 410 RX [.......hhhRRRRRRRRRRRR....] 411.Ed 412.Pp 413.Sh SLOTS AND PACKET BUFFERS 414Normally, packets should be stored in the netmap-allocated buffers 415assigned to slots when ports are bound to a file descriptor. 416One packet is fully contained in a single buffer. 417.Pp 418The following flags affect slot and buffer processing: 419.Bl -tag -width XXX 420.It NS_BUF_CHANGED 421it MUST be used when the buf_idx in the slot is changed. 422This can be used to implement 423zero-copy forwarding, see 424.Sx ZERO-COPY FORWARDING . 425.Pp 426.It NS_REPORT 427reports when this buffer has been transmitted. 428Normally, 429.Nm 430notifies transmit completions in batches, hence signals 431can be delayed indefinitely. This flag helps detecting 432when packets have been send and a file descriptor can be closed. 433.It NS_FORWARD 434When a ring is in 'transparent' mode (see 435.Sx TRANSPARENT MODE ) , 436packets marked with this flags are forwarded to the other endpoint 437at the next system call, thus restoring (in a selective way) 438the connection between a NIC and the host stack. 439.It NS_NO_LEARN 440tells the forwarding code that the SRC MAC address for this 441packet must not be used in the learning bridge code. 442.It NS_INDIRECT 443indicates that the packet's payload is in a user-supplied buffer, 444whose user virtual address is in the 'ptr' field of the slot. 445The size can reach 65535 bytes. 446.br 447This is only supported on the transmit ring of 448.Nm VALE 449ports, and it helps reducing data copies in the interconnection 450of virtual machines. 451.It NS_MOREFRAG 452indicates that the packet continues with subsequent buffers; 453the last buffer in a packet must have the flag clear. 454.El 455.Sh SCATTER GATHER I/O 456Packets can span multiple slots if the 457.Va NS_MOREFRAG 458flag is set in all but the last slot. 459The maximum length of a chain is 64 buffers. 460This is normally used with 461.Nm VALE 462ports when connecting virtual machines, as they generate large 463TSO segments that are not split unless they reach a physical device. 464.Pp 465NOTE: The length field always refers to the individual 466fragment; there is no place with the total length of a packet. 467.Pp 468On receive rings the macro 469.Va NS_RFRAGS(slot) 470indicates the remaining number of slots for this packet, 471including the current one. 472Slots with a value greater than 1 also have NS_MOREFRAG set. 473.Sh IOCTLS 474.Nm 475uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 476for non-blocking I/O. They take no argument. 477Two more ioctls (NIOCGINFO, NIOCREGIF) are used 478to query and configure ports, with the following argument: 479.Bd -literal 480struct nmreq { 481 char nr_name[IFNAMSIZ]; /* (i) port name */ 482 uint32_t nr_version; /* (i) API version */ 483 uint32_t nr_offset; /* (o) nifp offset in mmap region */ 484 uint32_t nr_memsize; /* (o) size of the mmap region */ 485 uint32_t nr_tx_slots; /* (o) slots in tx rings */ 486 uint32_t nr_rx_slots; /* (o) slots in rx rings */ 487 uint16_t nr_tx_rings; /* (o) number of tx rings */ 488 uint16_t nr_rx_rings; /* (o) number of tx rings */ 489 uint16_t nr_ringid; /* (i) ring(s) we care about */ 490 uint16_t nr_cmd; /* (i) special command */ 491 uint16_t nr_arg1; /* (i) extra arguments */ 492 uint16_t nr_arg2; /* (i) extra arguments */ 493 ... 494}; 495.Ed 496.Pp 497A file descriptor obtained through 498.Pa /dev/netmap 499also supports the ioctl supported by network devices, see 500.Xr netintro 4 . 501.Pp 502.Bl -tag -width XXXX 503.It Dv NIOCGINFO 504returns EINVAL if the named port does not support netmap. 505Otherwise, it returns 0 and (advisory) information 506about the port. 507Note that all the information below can change before the 508interface is actually put in netmap mode. 509.Pp 510.Bl -tag -width XX 511.It Pa nr_memsize 512indicates the size of the 513.Nm 514memory region. NICs in 515.Nm 516mode all share the same memory region, 517whereas 518.Nm VALE 519ports have independent regions for each port. 520.It Pa nr_tx_slots , nr_rx_slots 521indicate the size of transmit and receive rings. 522.It Pa nr_tx_rings , nr_rx_rings 523indicate the number of transmit 524and receive rings. 525Both ring number and sizes may be configured at runtime 526using interface-specific functions (e.g. 527.Xr ethtool 528). 529.El 530.It Dv NIOCREGIF 531binds the port named in 532.Va nr_name 533to the file descriptor. For a physical device this also switches it into 534.Nm 535mode, disconnecting 536it from the host stack. 537Multiple file descriptors can be bound to the same port, 538with proper synchronization left to the user. 539.Pp 540On return, it gives the same info as NIOCGINFO, and nr_ringid 541indicates the identity of the rings controlled through the file 542descriptor. 543.Pp 544.Va nr_ringid 545selects which rings are controlled through this file descriptor. 546Possible values are: 547.Bl -tag -width XXXXX 548.It 0 549(default) all hardware rings 550.It NETMAP_SW_RING 551the ``host rings'', connecting to the host stack. 552.It NETMAP_HW_RING | i 553the i-th hardware ring . 554.El 555.Pp 556By default, a 557.Xr poll 2 558or 559.Xr select 2 560call pushes out any pending packets on the transmit ring, even if 561no write events are specified. 562The feature can be disabled by or-ing 563.Va NETMAP_NO_TX_SYNC 564to the value written to 565.Va nr_ringid. 566When this feature is used, 567packets are transmitted only on 568.Va ioctl(NIOCTXSYNC) 569or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. 570.Pp 571When registering a virtual interface that is dynamically created to a 572.Xr vale 4 573switch, we can specify the desired number of rings (1 by default, 574and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 575.It Dv NIOCTXSYNC 576tells the hardware of new packets to transmit, and updates the 577number of slots available for transmission. 578.It Dv NIOCRXSYNC 579tells the hardware of consumed packets, and asks for newly available 580packets. 581.El 582.Sh SELECT AND POLL 583.Xr select 2 584and 585.Xr poll 2 586on a 587.Nm 588file descriptor process rings as indicated in 589.Sx TRANSMIT RINGS 590and 591.Sx RECEIVE RINGS 592when write (POLLOUT) and read (POLLIN) events are requested. 593.Pp 594Both block if no slots are available in the ring ( 595.Va ring->cur == ring->tail ) 596.Pp 597Packets in transmit rings are normally pushed out even without 598requesting write events. Passing the NETMAP_NO_TX_SYNC flag to 599.Em NIOCREGIF 600disables this feature. 601.Sh LIBRARIES 602The 603.Nm 604API is supposed to be used directly, both because of its simplicity and 605for efficient integration with applications. 606.Pp 607For conveniency, the 608.Va <net/netmap_user.h> 609header provides a few macros and functions to ease creating 610a file descriptor and doing I/O with a 611.Nm 612port. These are loosely modeled after the 613.Xr pcap 3 614API, to ease porting of libpcap-based applications to 615.Nm . 616To use these extra functions, programs should 617.Dl #define NETMAP_WITH_LIBS 618before 619.Dl #include <net/netmap_user.h> 620.Pp 621The following functions are available: 622.Bl -tag -width XXXXX 623.It Va struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags) 624similar to 625.Xr pcap_open , 626binds a file descriptor to a port. 627.Bl -tag -width XX 628.It Va ifname 629is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a 630.Nm VALE 631port. 632.It Va flags 633can be set to 634.Va NETMAP_SW_RING 635to bind to the host ring pair, 636or to NETMAP_HW_RING to bind to a specific ring. 637.Va ring_name 638with NETMAP_HW_RING, 639is interpreted as a string or an integer indicating the ring to use. 640.It Va ring_flags 641is copied directly into the ring flags, to specify additional parameters 642such as NR_TIMESTAMP or NR_FORWARD. 643.El 644.It Va int nm_close(struct nm_desc_t *d) 645closes the file descriptor, unmaps memory, frees resources. 646.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size) 647similar to pcap_inject(), pushes a packet to a ring, returns the size 648of the packet is successful, or 0 on error; 649.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg) 650similar to pcap_dispatch(), applies a callback to incoming packets 651.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr) 652similar to pcap_next(), fetches the next packet 653.Pp 654.El 655.Sh SUPPORTED DEVICES 656.Nm 657natively supports the following devices: 658.Pp 659On FreeBSD: 660.Xr em 4 , 661.Xr igb 4 , 662.Xr ixgbe 4 , 663.Xr lem 4 , 664.Xr re 4 . 665.Pp 666On Linux 667.Xr e1000 4 , 668.Xr e1000e 4 , 669.Xr igb 4 , 670.Xr ixgbe 4 , 671.Xr mlx4 4 , 672.Xr forcedeth 4 , 673.Xr r8169 4 . 674.Pp 675NICs without native support can still be used in 676.Nm 677mode through emulation. Performance is inferior to native netmap 678mode but still significantly higher than sockets, and approaching 679that of in-kernel solutions such as Linux's 680.Xr pktgen . 681.Pp 682Emulation is also available for devices with native netmap support, 683which can be used for testing or performance comparison. 684The sysctl variable 685.Va dev.netmap.admode 686globally controls how netmap mode is implemented. 687.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 688Some aspect of the operation of 689.Nm 690are controlled through sysctl variables on FreeBSD 691.Em ( dev.netmap.* ) 692and module parameters on Linux 693.Em ( /sys/module/netmap_lin/parameters/* ) : 694.Pp 695.Bl -tag -width indent 696.It Va dev.netmap.admode: 0 697Controls the use of native or emulated adapter mode. 6980 uses the best available option, 1 forces native and 699fails if not available, 2 forces emulated hence never fails. 700.It Va dev.netmap.generic_ringsize: 1024 701Ring size used for emulated netmap mode 702.It Va dev.netmap.generic_mit: 100000 703Controls interrupt moderation for emulated mode 704.It Va dev.netmap.mmap_unreg: 0 705.It Va dev.netmap.fwd: 0 706Forces NS_FORWARD mode 707.It Va dev.netmap.flags: 0 708.It Va dev.netmap.txsync_retry: 2 709.It Va dev.netmap.no_pendintr: 1 710Forces recovery of transmit buffers on system calls 711.It Va dev.netmap.mitigate: 1 712Propagates interrupt mitigation to user processes 713.It Va dev.netmap.no_timestamp: 0 714Disables the update of the timestamp in the netmap ring 715.It Va dev.netmap.verbose: 0 716Verbose kernel messages 717.It Va dev.netmap.buf_num: 163840 718.It Va dev.netmap.buf_size: 2048 719.It Va dev.netmap.ring_num: 200 720.It Va dev.netmap.ring_size: 36864 721.It Va dev.netmap.if_num: 100 722.It Va dev.netmap.if_size: 1024 723Sizes and number of objects (netmap_if, netmap_ring, buffers) 724for the global memory region. The only parameter worth modifying is 725.Va dev.netmap.buf_num 726as it impacts the total amount of memory used by netmap. 727.It Va dev.netmap.buf_curr_num: 0 728.It Va dev.netmap.buf_curr_size: 0 729.It Va dev.netmap.ring_curr_num: 0 730.It Va dev.netmap.ring_curr_size: 0 731.It Va dev.netmap.if_curr_num: 0 732.It Va dev.netmap.if_curr_size: 0 733Actual values in use. 734.It Va dev.netmap.bridge_batch: 1024 735Batch size used when moving packets across a 736.Nm VALE 737switch. Values above 64 generally guarantee good 738performance. 739.El 740.Sh SYSTEM CALLS 741.Nm 742uses 743.Xr select 2 744and 745.Xr poll 2 746to wake up processes when significant events occur, and 747.Xr mmap 2 748to map memory. 749.Xr ioctl 2 750is used to configure ports and 751.Nm VALE switches . 752.Pp 753Applications may need to create threads and bind them to 754specific cores to improve performance, using standard 755OS primitives, see 756.Xr pthread 3 . 757In particular, 758.Xr pthread_setaffinity_np 3 759may be of use. 760.Sh CAVEATS 761No matter how fast the CPU and OS are, 762achieving line rate on 10G and faster interfaces 763requires hardware with sufficient performance. 764Several NICs are unable to sustain line rate with 765small packet sizes. Insufficient PCIe or memory bandwidth 766can also cause reduced performance. 767.Pp 768Another frequent reason for low performance is the use 769of flow control on the link: a slow receiver can limit 770the transmit speed. 771Be sure to disable flow control when running high 772speed experiments. 773.Pp 774.Ss SPECIAL NIC FEATURES 775.Nm 776is orthogonal to some NIC features such as 777multiqueue, schedulers, packet filters. 778.Pp 779Multiple transmit and receive rings are supported natively 780and can be configured with ordinary OS tools, 781such as 782.Xr ethtool 783or 784device-specific sysctl variables. 785The same goes for Receive Packet Steering (RPS) 786and filtering of incoming traffic. 787.Pp 788.Nm 789.Em does not use 790features such as 791.Em checksum offloading , TCP segmentation offloading , 792.Em encryption , VLAN encapsulation/decapsulation , 793etc. . 794When using netmap to exchange packets with the host stack, 795make sure to disable these features. 796.Sh EXAMPLES 797.Ss TEST PROGRAMS 798.Nm 799comes with a few programs that can be used for testing or 800simple applications. 801See the 802.Va examples/ 803directory in 804.Nm 805distributions, or 806.Va tools/tools/netmap/ 807directory in FreeBSD distributions. 808.Pp 809.Xr pkt-gen 810is a general purpose traffic source/sink. 811.Pp 812As an example 813.Dl pkt-gen -i ix0 -f tx -l 60 814can generate an infinite stream of minimum size packets, and 815.Dl pkt-gen -i ix0 -f rx 816is a traffic sink. 817Both print traffic statistics, to help monitor 818how the system performs. 819.Pp 820.Xr pkt-gen 821has many options can be uses to set packet sizes, addresses, 822rates, and use multiple send/receive threads and cores. 823.Pp 824.Xr bridge 825is another test program which interconnects two 826.Nm 827ports. It can be used for transparent forwarding between 828interfaces, as in 829.Dl bridge -i ix0 -i ix1 830or even connect the NIC to the host stack using netmap 831.Dl bridge -i ix0 -i ix0 832.Ss USING THE NATIVE API 833The following code implements a traffic generator 834.Pp 835.Bd -literal -compact 836#include <net/netmap_user.h> 837... 838void sender(void) 839{ 840 struct netmap_if *nifp; 841 struct netmap_ring *ring; 842 struct nmreq nmr; 843 struct pollfd fds; 844 845 fd = open("/dev/netmap", O_RDWR); 846 bzero(&nmr, sizeof(nmr)); 847 strcpy(nmr.nr_name, "ix0"); 848 nmr.nm_version = NETMAP_API; 849 ioctl(fd, NIOCREGIF, &nmr); 850 p = mmap(0, nmr.nr_memsize, fd); 851 nifp = NETMAP_IF(p, nmr.nr_offset); 852 ring = NETMAP_TXRING(nifp, 0); 853 fds.fd = fd; 854 fds.events = POLLOUT; 855 for (;;) { 856 poll(&fds, 1, -1); 857 while (!nm_ring_empty(ring)) { 858 i = ring->cur; 859 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 860 ... prepare packet in buf ... 861 ring->slot[i].len = ... packet length ... 862 ring->head = ring->cur = nm_ring_next(ring, i); 863 } 864 } 865} 866.Ed 867.Ss HELPER FUNCTIONS 868A simple receiver can be implemented using the helper functions 869.Bd -literal -compact 870#define NETMAP_WITH_LIBS 871#include <net/netmap_user.h> 872... 873void receiver(void) 874{ 875 struct nm_desc_t *d; 876 struct pollfd fds; 877 u_char *buf; 878 struct nm_hdr_t h; 879 ... 880 d = nm_open("netmap:ix0", NULL, 0, 0); 881 fds.fd = NETMAP_FD(d); 882 fds.events = POLLIN; 883 for (;;) { 884 poll(&fds, 1, -1); 885 while ( (buf = nm_nextpkt(d, &h)) ) 886 consume_pkt(buf, h->len); 887 } 888 nm_close(d); 889} 890.Ed 891.Ss ZERO-COPY FORWARDING 892Since physical interfaces share the same memory region, 893it is possible to do packet forwarding between ports 894swapping buffers. The buffer from the transmit ring is used 895to replenish the receive ring: 896.Bd -literal -compact 897 uint32_t tmp; 898 struct netmap_slot *src, *dst; 899 ... 900 src = &src_ring->slot[rxr->cur]; 901 dst = &dst_ring->slot[txr->cur]; 902 tmp = dst->buf_idx; 903 dst->buf_idx = src->buf_idx; 904 dst->len = src->len; 905 dst->flags = NS_BUF_CHANGED; 906 src->buf_idx = tmp; 907 src->flags = NS_BUF_CHANGED; 908 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 909 txr->head = txr->cur = nm_ring_next(txr, txr->cur); 910 ... 911.Ed 912.Ss ACCESSING THE HOST STACK 913.Ss VALE SWITCH 914A simple way to test the performance of a 915.Nm VALE 916switch is to attach a sender and a receiver to it, 917e.g. running the following in two different terminals: 918.Dl pkt-gen -i vale1:a -f rx # receiver 919.Dl pkt-gen -i vale1:b -f tx # sender 920.Pp 921The following command attaches an interface and the host stack 922to a switch: 923.Dl vale-ctl -h vale2:em0 924Other 925.Nm 926clients attached to the same switch can now communicate 927with the network card or the host. 928.Pp 929.Sh SEE ALSO 930.Pp 931http://info.iet.unipi.it/~luigi/netmap/ 932.Pp 933Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 934Communications of the ACM, 55 (3), pp.45-51, March 2012 935.Pp 936Luigi Rizzo, netmap: a novel framework for fast packet I/O, 937Usenix ATC'12, June 2012, Boston 938.Sh AUTHORS 939.An -nosplit 940The 941.Nm 942framework has been originally designed and implemented at the 943Universita` di Pisa in 2011 by 944.An Luigi Rizzo , 945and further extended with help from 946.An Matteo Landi , 947.An Gaetano Catalli , 948.An Giuseppe Lettieri , 949.An Vincenzo Maffione . 950.Pp 951.Nm 952and 953.Nm VALE 954have been funded by the European Commission within FP7 Projects 955CHANGE (257422) and OPENLAB (287581). 956.Pp 957.Ss SPECIAL MODES 958When the device name has the form 959.Dl valeXXX:ifname (ifname is an existing interface) 960the physical interface 961(and optionally the corrisponding host stack endpoint) 962are connected or disconnected from the 963.Nm VALE 964switch named XXX. 965In this case the 966.Pa ioctl() 967is only used only for configuration, typically through the 968.Xr vale-ctl 969command. 970The file descriptor cannot be used for I/O, and should be 971closed after issuing the 972.Pa ioctl() . 973