1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd December 14, 2015 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.Pp 37.Nm VALE 38.Nd a fast VirtuAl Local Ethernet using the netmap API 39.Pp 40.Nm netmap pipes 41.Nd a shared memory packet transport channel 42.Sh SYNOPSIS 43.Cd device netmap 44.Sh DESCRIPTION 45.Nm 46is a framework for extremely fast and efficient packet I/O 47for both userspace and kernel clients. 48It runs on 49.Fx 50and Linux, and includes 51.Nm VALE , 52a very fast and modular in-kernel software switch/dataplane, 53and 54.Nm netmap pipes , 55a shared memory packet transport channel. 56All these are accessed interchangeably with the same API. 57.Pp 58.Nm , 59.Nm VALE 60and 61.Nm netmap pipes 62are at least one order of magnitude faster than 63standard OS mechanisms 64(sockets, bpf, tun/tap interfaces, native switches, pipes), 65reaching 14.88 million packets per second (Mpps) 66with much less than one core on a 10 Gbit NIC, 67about 20 Mpps per core for VALE ports, 68and over 100 Mpps for netmap pipes. 69.Pp 70Userspace clients can dynamically switch NICs into 71.Nm 72mode and send and receive raw packets through 73memory mapped buffers. 74Similarly, 75.Nm VALE 76switch instances and ports, and 77.Nm netmap pipes 78can be created dynamically, 79providing high speed packet I/O between processes, 80virtual machines, NICs and the host stack. 81.Pp 82.Nm 83supports both non-blocking I/O through 84.Xr ioctl 2 , 85synchronization and blocking I/O through a file descriptor 86and standard OS mechanisms such as 87.Xr select 2 , 88.Xr poll 2 , 89.Xr epoll 2 , 90and 91.Xr kqueue 2 . 92.Nm VALE 93and 94.Nm netmap pipes 95are implemented by a single kernel module, which also emulates the 96.Nm 97API over standard drivers for devices without native 98.Nm 99support. 100For best performance, 101.Nm 102requires explicit support in device drivers. 103.Pp 104In the rest of this (long) manual page we document 105various aspects of the 106.Nm 107and 108.Nm VALE 109architecture, features and usage. 110.Sh ARCHITECTURE 111.Nm 112supports raw packet I/O through a 113.Em port , 114which can be connected to a physical interface 115.Em ( NIC ) , 116to the host stack, 117or to a 118.Nm VALE 119switch). 120Ports use preallocated circular queues of buffers 121.Em ( rings ) 122residing in an mmapped region. 123There is one ring for each transmit/receive queue of a 124NIC or virtual port. 125An additional ring pair connects to the host stack. 126.Pp 127After binding a file descriptor to a port, a 128.Nm 129client can send or receive packets in batches through 130the rings, and possibly implement zero-copy forwarding 131between ports. 132.Pp 133All NICs operating in 134.Nm 135mode use the same memory region, 136accessible to all processes who own 137.Pa /dev/netmap 138file descriptors bound to NICs. 139Independent 140.Nm VALE 141and 142.Nm netmap pipe 143ports 144by default use separate memory regions, 145but can be independently configured to share memory. 146.Sh ENTERING AND EXITING NETMAP MODE 147The following section describes the system calls to create 148and control 149.Nm netmap 150ports (including 151.Nm VALE 152and 153.Nm netmap pipe 154ports). 155Simpler, higher level functions are described in section 156.Xr LIBRARIES . 157.Pp 158Ports and rings are created and controlled through a file descriptor, 159created by opening a special device 160.Dl fd = open("/dev/netmap"); 161and then bound to a specific port with an 162.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 163.Pp 164.Nm 165has multiple modes of operation controlled by the 166.Vt struct nmreq 167argument. 168.Va arg.nr_name 169specifies the port name, as follows: 170.Bl -tag -width XXXX 171.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) 172the data path of the NIC is disconnected from the host stack, 173and the file descriptor is bound to the NIC (one or all queues), 174or to the host stack; 175.It Dv valeXXX:YYY (arbitrary XXX and YYY) 176the file descriptor is bound to port YYY of a VALE switch called XXX, 177both dynamically created if necessary. 178The string cannot exceed IFNAMSIZ characters, and YYY cannot 179be the name of any existing OS network interface. 180.El 181.Pp 182On return, 183.Va arg 184indicates the size of the shared memory region, 185and the number, size and location of all the 186.Nm 187data structures, which can be accessed by mmapping the memory 188.Dl char *mem = mmap(0, arg.nr_memsize, fd); 189.Pp 190Non-blocking I/O is done with special 191.Xr ioctl 2 192.Xr select 2 193and 194.Xr poll 2 195on the file descriptor permit blocking I/O. 196.Xr epoll 2 197and 198.Xr kqueue 2 199are not supported on 200.Nm 201file descriptors. 202.Pp 203While a NIC is in 204.Nm 205mode, the OS will still believe the interface is up and running. 206OS-generated packets for that NIC end up into a 207.Nm 208ring, and another ring is used to send packets into the OS network stack. 209A 210.Xr close 2 211on the file descriptor removes the binding, 212and returns the NIC to normal mode (reconnecting the data path 213to the host stack), or destroys the virtual port. 214.Sh DATA STRUCTURES 215The data structures in the mmapped memory region are detailed in 216.In sys/net/netmap.h , 217which is the ultimate reference for the 218.Nm 219API. 220The main structures and fields are indicated below: 221.Bl -tag -width XXX 222.It Dv struct netmap_if (one per interface) 223.Bd -literal 224struct netmap_if { 225 ... 226 const uint32_t ni_flags; /* properties */ 227 ... 228 const uint32_t ni_tx_rings; /* NIC tx rings */ 229 const uint32_t ni_rx_rings; /* NIC rx rings */ 230 uint32_t ni_bufs_head; /* head of extra bufs list */ 231 ... 232}; 233.Ed 234.Pp 235Indicates the number of available rings 236.Pa ( struct netmap_rings ) 237and their position in the mmapped region. 238The number of tx and rx rings 239.Pa ( ni_tx_rings , ni_rx_rings ) 240normally depends on the hardware. 241NICs also have an extra tx/rx ring pair connected to the host stack. 242.Em NIOCREGIF 243can also request additional unbound buffers in the same memory space, 244to be used as temporary storage for packets. 245.Pa ni_bufs_head 246contains the index of the first of these free rings, 247which are connected in a list (the first uint32_t of each 248buffer being the index of the next buffer in the list). 249A 250.Dv 0 251indicates the end of the list. 252.It Dv struct netmap_ring (one per ring) 253.Bd -literal 254struct netmap_ring { 255 ... 256 const uint32_t num_slots; /* slots in each ring */ 257 const uint32_t nr_buf_size; /* size of each buffer */ 258 ... 259 uint32_t head; /* (u) first buf owned by user */ 260 uint32_t cur; /* (u) wakeup position */ 261 const uint32_t tail; /* (k) first buf owned by kernel */ 262 ... 263 uint32_t flags; 264 struct timeval ts; /* (k) time of last rxsync() */ 265 ... 266 struct netmap_slot slot[0]; /* array of slots */ 267} 268.Ed 269.Pp 270Implements transmit and receive rings, with read/write 271pointers, metadata and an array of 272.Em slots 273describing the buffers. 274.It Dv struct netmap_slot (one per buffer) 275.Bd -literal 276struct netmap_slot { 277 uint32_t buf_idx; /* buffer index */ 278 uint16_t len; /* packet length */ 279 uint16_t flags; /* buf changed, etc. */ 280 uint64_t ptr; /* address for indirect buffers */ 281}; 282.Ed 283.Pp 284Describes a packet buffer, which normally is identified by 285an index and resides in the mmapped region. 286.It Dv packet buffers 287Fixed size (normally 2 KB) packet buffers allocated by the kernel. 288.El 289.Pp 290The offset of the 291.Pa struct netmap_if 292in the mmapped region is indicated by the 293.Pa nr_offset 294field in the structure returned by 295.Dv NIOCREGIF . 296From there, all other objects are reachable through 297relative references (offsets or indexes). 298Macros and functions in 299.In net/netmap_user.h 300help converting them into actual pointers: 301.Pp 302.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 303.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 304.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 305.Pp 306.Dl char *buf = NETMAP_BUF(ring, buffer_index); 307.Sh RINGS, BUFFERS AND DATA I/O 308.Va Rings 309are circular queues of packets with three indexes/pointers 310.Va ( head , cur , tail ) ; 311one slot is always kept empty. 312The ring size 313.Va ( num_slots ) 314should not be assumed to be a power of two. 315.br 316(NOTE: older versions of netmap used head/count format to indicate 317the content of a ring). 318.Pp 319.Va head 320is the first slot available to userspace; 321.br 322.Va cur 323is the wakeup point: 324select/poll will unblock when 325.Va tail 326passes 327.Va cur ; 328.br 329.Va tail 330is the first slot reserved to the kernel. 331.Pp 332Slot indexes 333.Em must 334only move forward; 335for convenience, the function 336.Dl nm_ring_next(ring, index) 337returns the next index modulo the ring size. 338.Pp 339.Va head 340and 341.Va cur 342are only modified by the user program; 343.Va tail 344is only modified by the kernel. 345The kernel only reads/writes the 346.Vt struct netmap_ring 347slots and buffers 348during the execution of a netmap-related system call. 349The only exception are slots (and buffers) in the range 350.Va tail\ . . . head-1 , 351that are explicitly assigned to the kernel. 352.Pp 353.Ss TRANSMIT RINGS 354On transmit rings, after a 355.Nm 356system call, slots in the range 357.Va head\ . . . tail-1 358are available for transmission. 359User code should fill the slots sequentially 360and advance 361.Va head 362and 363.Va cur 364past slots ready to transmit. 365.Va cur 366may be moved further ahead if the user code needs 367more slots before further transmissions (see 368.Sx SCATTER GATHER I/O ) . 369.Pp 370At the next NIOCTXSYNC/select()/poll(), 371slots up to 372.Va head-1 373are pushed to the port, and 374.Va tail 375may advance if further slots have become available. 376Below is an example of the evolution of a TX ring: 377.Bd -literal 378 after the syscall, slots between cur and tail are (a)vailable 379 head=cur tail 380 | | 381 v v 382 TX [.....aaaaaaaaaaa.............] 383 384 user creates new packets to (T)ransmit 385 head=cur tail 386 | | 387 v v 388 TX [.....TTTTTaaaaaa.............] 389 390 NIOCTXSYNC/poll()/select() sends packets and reports new slots 391 head=cur tail 392 | | 393 v v 394 TX [..........aaaaaaaaaaa........] 395.Ed 396.Pp 397.Fn select 398and 399.Fn poll 400will block if there is no space in the ring, i.e. 401.Dl ring->cur == ring->tail 402and return when new slots have become available. 403.Pp 404High speed applications may want to amortize the cost of system calls 405by preparing as many packets as possible before issuing them. 406.Pp 407A transmit ring with pending transmissions has 408.Dl ring->head != ring->tail + 1 (modulo the ring size). 409The function 410.Va int nm_tx_pending(ring) 411implements this test. 412.Ss RECEIVE RINGS 413On receive rings, after a 414.Nm 415system call, the slots in the range 416.Va head\& . . . tail-1 417contain received packets. 418User code should process them and advance 419.Va head 420and 421.Va cur 422past slots it wants to return to the kernel. 423.Va cur 424may be moved further ahead if the user code wants to 425wait for more packets 426without returning all the previous slots to the kernel. 427.Pp 428At the next NIOCRXSYNC/select()/poll(), 429slots up to 430.Va head-1 431are returned to the kernel for further receives, and 432.Va tail 433may advance to report new incoming packets. 434.br 435Below is an example of the evolution of an RX ring: 436.Bd -literal 437 after the syscall, there are some (h)eld and some (R)eceived slots 438 head cur tail 439 | | | 440 v v v 441 RX [..hhhhhhRRRRRRRR..........] 442 443 user advances head and cur, releasing some slots and holding others 444 head cur tail 445 | | | 446 v v v 447 RX [..*****hhhRRRRRR...........] 448 449 NICRXSYNC/poll()/select() recovers slots and reports new packets 450 head cur tail 451 | | | 452 v v v 453 RX [.......hhhRRRRRRRRRRRR....] 454.Ed 455.Sh SLOTS AND PACKET BUFFERS 456Normally, packets should be stored in the netmap-allocated buffers 457assigned to slots when ports are bound to a file descriptor. 458One packet is fully contained in a single buffer. 459.Pp 460The following flags affect slot and buffer processing: 461.Bl -tag -width XXX 462.It NS_BUF_CHANGED 463.Em must 464be used when the 465.Va buf_idx 466in the slot is changed. 467This can be used to implement 468zero-copy forwarding, see 469.Sx ZERO-COPY FORWARDING . 470.It NS_REPORT 471reports when this buffer has been transmitted. 472Normally, 473.Nm 474notifies transmit completions in batches, hence signals 475can be delayed indefinitely. 476This flag helps detect 477when packets have been sent and a file descriptor can be closed. 478.It NS_FORWARD 479When a ring is in 'transparent' mode (see 480.Sx TRANSPARENT MODE ) , 481packets marked with this flag are forwarded to the other endpoint 482at the next system call, thus restoring (in a selective way) 483the connection between a NIC and the host stack. 484.It NS_NO_LEARN 485tells the forwarding code that the source MAC address for this 486packet must not be used in the learning bridge code. 487.It NS_INDIRECT 488indicates that the packet's payload is in a user-supplied buffer 489whose user virtual address is in the 'ptr' field of the slot. 490The size can reach 65535 bytes. 491.br 492This is only supported on the transmit ring of 493.Nm VALE 494ports, and it helps reducing data copies in the interconnection 495of virtual machines. 496.It NS_MOREFRAG 497indicates that the packet continues with subsequent buffers; 498the last buffer in a packet must have the flag clear. 499.El 500.Sh SCATTER GATHER I/O 501Packets can span multiple slots if the 502.Va NS_MOREFRAG 503flag is set in all but the last slot. 504The maximum length of a chain is 64 buffers. 505This is normally used with 506.Nm VALE 507ports when connecting virtual machines, as they generate large 508TSO segments that are not split unless they reach a physical device. 509.Pp 510NOTE: The length field always refers to the individual 511fragment; there is no place with the total length of a packet. 512.Pp 513On receive rings the macro 514.Va NS_RFRAGS(slot) 515indicates the remaining number of slots for this packet, 516including the current one. 517Slots with a value greater than 1 also have NS_MOREFRAG set. 518.Sh IOCTLS 519.Nm 520uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 521for non-blocking I/O. 522They take no argument. 523Two more ioctls (NIOCGINFO, NIOCREGIF) are used 524to query and configure ports, with the following argument: 525.Bd -literal 526struct nmreq { 527 char nr_name[IFNAMSIZ]; /* (i) port name */ 528 uint32_t nr_version; /* (i) API version */ 529 uint32_t nr_offset; /* (o) nifp offset in mmap region */ 530 uint32_t nr_memsize; /* (o) size of the mmap region */ 531 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ 532 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ 533 uint16_t nr_tx_rings; /* (i/o) number of tx rings */ 534 uint16_t nr_rx_rings; /* (i/o) number of rx rings */ 535 uint16_t nr_ringid; /* (i/o) ring(s) we care about */ 536 uint16_t nr_cmd; /* (i) special command */ 537 uint16_t nr_arg1; /* (i/o) extra arguments */ 538 uint16_t nr_arg2; /* (i/o) extra arguments */ 539 uint32_t nr_arg3; /* (i/o) extra arguments */ 540 uint32_t nr_flags /* (i/o) open mode */ 541 ... 542}; 543.Ed 544.Pp 545A file descriptor obtained through 546.Pa /dev/netmap 547also supports the ioctl supported by network devices, see 548.Xr netintro 4 . 549.Bl -tag -width XXXX 550.It Dv NIOCGINFO 551returns EINVAL if the named port does not support netmap. 552Otherwise, it returns 0 and (advisory) information 553about the port. 554Note that all the information below can change before the 555interface is actually put in netmap mode. 556.Bl -tag -width XX 557.It Pa nr_memsize 558indicates the size of the 559.Nm 560memory region. 561NICs in 562.Nm 563mode all share the same memory region, 564whereas 565.Nm VALE 566ports have independent regions for each port. 567.It Pa nr_tx_slots , nr_rx_slots 568indicate the size of transmit and receive rings. 569.It Pa nr_tx_rings , nr_rx_rings 570indicate the number of transmit 571and receive rings. 572Both ring number and sizes may be configured at runtime 573using interface-specific functions (e.g. 574.Xr ethtool 575). 576.El 577.It Dv NIOCREGIF 578binds the port named in 579.Va nr_name 580to the file descriptor. 581For a physical device this also switches it into 582.Nm 583mode, disconnecting 584it from the host stack. 585Multiple file descriptors can be bound to the same port, 586with proper synchronization left to the user. 587.Pp 588.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a 589.Em netmap pipe , 590consisting of two netmap ports with a crossover connection. 591A netmap pipe share the same memory space of the parent port, 592and is meant to enable configuration where a master process acts 593as a dispatcher towards slave processes. 594.Pp 595To enable this function, the 596.Pa nr_arg1 597field of the structure can be used as a hint to the kernel to 598indicate how many pipes we expect to use, and reserve extra space 599in the memory region. 600.Pp 601On return, it gives the same info as NIOCGINFO, 602with 603.Pa nr_ringid 604and 605.Pa nr_flags 606indicating the identity of the rings controlled through the file 607descriptor. 608.Pp 609.Va nr_flags 610.Va nr_ringid 611selects which rings are controlled through this file descriptor. 612Possible values of 613.Pa nr_flags 614are indicated below, together with the naming schemes 615that application libraries (such as the 616.Nm nm_open 617indicated below) can use to indicate the specific set of rings. 618In the example below, "netmap:foo" is any valid netmap port name. 619.Bl -tag -width XXXXX 620.It NR_REG_ALL_NIC "netmap:foo" 621(default) all hardware ring pairs 622.It NR_REG_SW "netmap:foo^" 623the ``host rings'', connecting to the host stack. 624.It NR_REG_NIC_SW "netmap:foo+" 625all hardware rings and the host rings 626.It NR_REG_ONE_NIC "netmap:foo-i" 627only the i-th hardware ring pair, where the number is in 628.Pa nr_ringid ; 629.It NR_REG_PIPE_MASTER "netmap:foo{i" 630the master side of the netmap pipe whose identifier (i) is in 631.Pa nr_ringid ; 632.It NR_REG_PIPE_SLAVE "netmap:foo}i" 633the slave side of the netmap pipe whose identifier (i) is in 634.Pa nr_ringid . 635.Pp 636The identifier of a pipe must be thought as part of the pipe name, 637and does not need to be sequential. 638On return the pipe 639will only have a single ring pair with index 0, 640irrespective of the value of 641.Va i. 642.El 643.Pp 644By default, a 645.Xr poll 2 646or 647.Xr select 2 648call pushes out any pending packets on the transmit ring, even if 649no write events are specified. 650The feature can be disabled by or-ing 651.Va NETMAP_NO_TX_POLL 652to the value written to 653.Va nr_ringid. 654When this feature is used, 655packets are transmitted only on 656.Va ioctl(NIOCTXSYNC) 657or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. 658.Pp 659When registering a virtual interface that is dynamically created to a 660.Xr vale 4 661switch, we can specify the desired number of rings (1 by default, 662and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 663.It Dv NIOCTXSYNC 664tells the hardware of new packets to transmit, and updates the 665number of slots available for transmission. 666.It Dv NIOCRXSYNC 667tells the hardware of consumed packets, and asks for newly available 668packets. 669.El 670.Sh SELECT, POLL, EPOLL, KQUEUE. 671.Xr select 2 672and 673.Xr poll 2 674on a 675.Nm 676file descriptor process rings as indicated in 677.Sx TRANSMIT RINGS 678and 679.Sx RECEIVE RINGS , 680respectively when write (POLLOUT) and read (POLLIN) events are requested. 681Both block if no slots are available in the ring 682.Va ( ring->cur == ring->tail ) . 683Depending on the platform, 684.Xr epoll 2 685and 686.Xr kqueue 2 687are supported too. 688.Pp 689Packets in transmit rings are normally pushed out 690(and buffers reclaimed) even without 691requesting write events. 692Passing the 693.Dv NETMAP_NO_TX_POLL 694flag to 695.Em NIOCREGIF 696disables this feature. 697By default, receive rings are processed only if read 698events are requested. 699Passing the 700.Dv NETMAP_DO_RX_POLL 701flag to 702.Em NIOCREGIF updates receive rings even without read events. 703Note that on epoll and kqueue, 704.Dv NETMAP_NO_TX_POLL 705and 706.Dv NETMAP_DO_RX_POLL 707only have an effect when some event is posted for the file descriptor. 708.Sh LIBRARIES 709The 710.Nm 711API is supposed to be used directly, both because of its simplicity and 712for efficient integration with applications. 713.Pp 714For convenience, the 715.In net/netmap_user.h 716header provides a few macros and functions to ease creating 717a file descriptor and doing I/O with a 718.Nm 719port. 720These are loosely modeled after the 721.Xr pcap 3 722API, to ease porting of libpcap-based applications to 723.Nm . 724To use these extra functions, programs should 725.Dl #define NETMAP_WITH_LIBS 726before 727.Dl #include <net/netmap_user.h> 728.Pp 729The following functions are available: 730.Bl -tag -width XXXXX 731.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) 732similar to 733.Xr pcap_open , 734binds a file descriptor to a port. 735.Bl -tag -width XX 736.It Va ifname 737is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a 738.Nm VALE 739port. 740.It Va req 741provides the initial values for the argument to the NIOCREGIF ioctl. 742The nm_flags and nm_ringid values are overwritten by parsing 743ifname and flags, and other fields can be overridden through 744the other two arguments. 745.It Va arg 746points to a struct nm_desc containing arguments (e.g. from a previously 747open file descriptor) that should override the defaults. 748The fields are used as described below 749.It Va flags 750can be set to a combination of the following flags: 751.Va NETMAP_NO_TX_POLL , 752.Va NETMAP_DO_RX_POLL 753(copied into nr_ringid); 754.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, 755avoids the mmap and uses the values from it); 756.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); 757.Va NM_OPEN_ARG1 , 758.Va NM_OPEN_ARG2 , 759.Va NM_OPEN_ARG3 (uses the fields from arg); 760.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). 761.El 762.It Va int nm_close(struct nm_desc *d) 763closes the file descriptor, unmaps memory, frees resources. 764.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) 765similar to pcap_inject(), pushes a packet to a ring, returns the size 766of the packet is successful, or 0 on error; 767.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) 768similar to pcap_dispatch(), applies a callback to incoming packets 769.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) 770similar to pcap_next(), fetches the next packet 771.El 772.Sh SUPPORTED DEVICES 773.Nm 774natively supports the following devices: 775.Pp 776On FreeBSD: 777.Xr em 4 , 778.Xr igb 4 , 779.Xr ixgbe 4 , 780.Xr lem 4 , 781.Xr re 4 . 782.Pp 783On Linux 784.Xr e1000 4 , 785.Xr e1000e 4 , 786.Xr igb 4 , 787.Xr ixgbe 4 , 788.Xr mlx4 4 , 789.Xr forcedeth 4 , 790.Xr r8169 4 . 791.Pp 792NICs without native support can still be used in 793.Nm 794mode through emulation. 795Performance is inferior to native netmap 796mode but still significantly higher than sockets, and approaching 797that of in-kernel solutions such as Linux's 798.Xr pktgen . 799.Pp 800Emulation is also available for devices with native netmap support, 801which can be used for testing or performance comparison. 802The sysctl variable 803.Va dev.netmap.admode 804globally controls how netmap mode is implemented. 805.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 806Some aspect of the operation of 807.Nm 808are controlled through sysctl variables on FreeBSD 809.Em ( dev.netmap.* ) 810and module parameters on Linux 811.Em ( /sys/module/netmap_lin/parameters/* ) : 812.Bl -tag -width indent 813.It Va dev.netmap.admode: 0 814Controls the use of native or emulated adapter mode. 8150 uses the best available option, 1 forces native and 816fails if not available, 2 forces emulated hence never fails. 817.It Va dev.netmap.generic_ringsize: 1024 818Ring size used for emulated netmap mode 819.It Va dev.netmap.generic_mit: 100000 820Controls interrupt moderation for emulated mode 821.It Va dev.netmap.mmap_unreg: 0 822.It Va dev.netmap.fwd: 0 823Forces NS_FORWARD mode 824.It Va dev.netmap.flags: 0 825.It Va dev.netmap.txsync_retry: 2 826.It Va dev.netmap.no_pendintr: 1 827Forces recovery of transmit buffers on system calls 828.It Va dev.netmap.mitigate: 1 829Propagates interrupt mitigation to user processes 830.It Va dev.netmap.no_timestamp: 0 831Disables the update of the timestamp in the netmap ring 832.It Va dev.netmap.verbose: 0 833Verbose kernel messages 834.It Va dev.netmap.buf_num: 163840 835.It Va dev.netmap.buf_size: 2048 836.It Va dev.netmap.ring_num: 200 837.It Va dev.netmap.ring_size: 36864 838.It Va dev.netmap.if_num: 100 839.It Va dev.netmap.if_size: 1024 840Sizes and number of objects (netmap_if, netmap_ring, buffers) 841for the global memory region. 842The only parameter worth modifying is 843.Va dev.netmap.buf_num 844as it impacts the total amount of memory used by netmap. 845.It Va dev.netmap.buf_curr_num: 0 846.It Va dev.netmap.buf_curr_size: 0 847.It Va dev.netmap.ring_curr_num: 0 848.It Va dev.netmap.ring_curr_size: 0 849.It Va dev.netmap.if_curr_num: 0 850.It Va dev.netmap.if_curr_size: 0 851Actual values in use. 852.It Va dev.netmap.bridge_batch: 1024 853Batch size used when moving packets across a 854.Nm VALE 855switch. 856Values above 64 generally guarantee good 857performance. 858.El 859.Sh SYSTEM CALLS 860.Nm 861uses 862.Xr select 2 , 863.Xr poll 2 , 864.Xr epoll 865and 866.Xr kqueue 867to wake up processes when significant events occur, and 868.Xr mmap 2 869to map memory. 870.Xr ioctl 2 871is used to configure ports and 872.Nm VALE switches . 873.Pp 874Applications may need to create threads and bind them to 875specific cores to improve performance, using standard 876OS primitives, see 877.Xr pthread 3 . 878In particular, 879.Xr pthread_setaffinity_np 3 880may be of use. 881.Sh EXAMPLES 882.Ss TEST PROGRAMS 883.Nm 884comes with a few programs that can be used for testing or 885simple applications. 886See the 887.Pa examples/ 888directory in 889.Nm 890distributions, or 891.Pa tools/tools/netmap/ 892directory in 893.Fx 894distributions. 895.Pp 896.Xr pkt-gen 897is a general purpose traffic source/sink. 898.Pp 899As an example 900.Dl pkt-gen -i ix0 -f tx -l 60 901can generate an infinite stream of minimum size packets, and 902.Dl pkt-gen -i ix0 -f rx 903is a traffic sink. 904Both print traffic statistics, to help monitor 905how the system performs. 906.Pp 907.Xr pkt-gen 908has many options can be uses to set packet sizes, addresses, 909rates, and use multiple send/receive threads and cores. 910.Pp 911.Xr bridge 912is another test program which interconnects two 913.Nm 914ports. 915It can be used for transparent forwarding between 916interfaces, as in 917.Dl bridge -i ix0 -i ix1 918or even connect the NIC to the host stack using netmap 919.Dl bridge -i ix0 -i ix0 920.Ss USING THE NATIVE API 921The following code implements a traffic generator 922.Pp 923.Bd -literal -compact 924#include <net/netmap_user.h> 925\&... 926void sender(void) 927{ 928 struct netmap_if *nifp; 929 struct netmap_ring *ring; 930 struct nmreq nmr; 931 struct pollfd fds; 932 933 fd = open("/dev/netmap", O_RDWR); 934 bzero(&nmr, sizeof(nmr)); 935 strcpy(nmr.nr_name, "ix0"); 936 nmr.nm_version = NETMAP_API; 937 ioctl(fd, NIOCREGIF, &nmr); 938 p = mmap(0, nmr.nr_memsize, fd); 939 nifp = NETMAP_IF(p, nmr.nr_offset); 940 ring = NETMAP_TXRING(nifp, 0); 941 fds.fd = fd; 942 fds.events = POLLOUT; 943 for (;;) { 944 poll(&fds, 1, -1); 945 while (!nm_ring_empty(ring)) { 946 i = ring->cur; 947 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 948 ... prepare packet in buf ... 949 ring->slot[i].len = ... packet length ... 950 ring->head = ring->cur = nm_ring_next(ring, i); 951 } 952 } 953} 954.Ed 955.Ss HELPER FUNCTIONS 956A simple receiver can be implemented using the helper functions 957.Bd -literal -compact 958#define NETMAP_WITH_LIBS 959#include <net/netmap_user.h> 960\&... 961void receiver(void) 962{ 963 struct nm_desc *d; 964 struct pollfd fds; 965 u_char *buf; 966 struct nm_pkthdr h; 967 ... 968 d = nm_open("netmap:ix0", NULL, 0, 0); 969 fds.fd = NETMAP_FD(d); 970 fds.events = POLLIN; 971 for (;;) { 972 poll(&fds, 1, -1); 973 while ( (buf = nm_nextpkt(d, &h)) ) 974 consume_pkt(buf, h->len); 975 } 976 nm_close(d); 977} 978.Ed 979.Ss ZERO-COPY FORWARDING 980Since physical interfaces share the same memory region, 981it is possible to do packet forwarding between ports 982swapping buffers. 983The buffer from the transmit ring is used 984to replenish the receive ring: 985.Bd -literal -compact 986 uint32_t tmp; 987 struct netmap_slot *src, *dst; 988 ... 989 src = &src_ring->slot[rxr->cur]; 990 dst = &dst_ring->slot[txr->cur]; 991 tmp = dst->buf_idx; 992 dst->buf_idx = src->buf_idx; 993 dst->len = src->len; 994 dst->flags = NS_BUF_CHANGED; 995 src->buf_idx = tmp; 996 src->flags = NS_BUF_CHANGED; 997 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 998 txr->head = txr->cur = nm_ring_next(txr, txr->cur); 999 ... 1000.Ed 1001.Ss ACCESSING THE HOST STACK 1002The host stack is for all practical purposes just a regular ring pair, 1003which you can access with the netmap API (e.g. with 1004.Dl nm_open("netmap:eth0^", ... ) ; 1005All packets that the host would send to an interface in 1006.Nm 1007mode end up into the RX ring, whereas all packets queued to the 1008TX ring are send up to the host stack. 1009.Ss VALE SWITCH 1010A simple way to test the performance of a 1011.Nm VALE 1012switch is to attach a sender and a receiver to it, 1013e.g. running the following in two different terminals: 1014.Dl pkt-gen -i vale1:a -f rx # receiver 1015.Dl pkt-gen -i vale1:b -f tx # sender 1016The same example can be used to test netmap pipes, by simply 1017changing port names, e.g. 1018.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side 1019.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side 1020.Pp 1021The following command attaches an interface and the host stack 1022to a switch: 1023.Dl vale-ctl -h vale2:em0 1024Other 1025.Nm 1026clients attached to the same switch can now communicate 1027with the network card or the host. 1028.Sh SEE ALSO 1029.Pa http://info.iet.unipi.it/~luigi/netmap/ 1030.Pp 1031Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 1032Communications of the ACM, 55 (3), pp.45-51, March 2012 1033.Pp 1034Luigi Rizzo, netmap: a novel framework for fast packet I/O, 1035Usenix ATC'12, June 2012, Boston 1036.Pp 1037Luigi Rizzo, Giuseppe Lettieri, 1038VALE, a switched ethernet for virtual machines, 1039ACM CoNEXT'12, December 2012, Nice 1040.Pp 1041Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, 1042Speeding up packet I/O in virtual machines, 1043ACM/IEEE ANCS'13, October 2013, San Jose 1044.Sh AUTHORS 1045.An -nosplit 1046The 1047.Nm 1048framework has been originally designed and implemented at the 1049Universita` di Pisa in 2011 by 1050.An Luigi Rizzo , 1051and further extended with help from 1052.An Matteo Landi , 1053.An Gaetano Catalli , 1054.An Giuseppe Lettieri , 1055and 1056.An Vincenzo Maffione . 1057.Pp 1058.Nm 1059and 1060.Nm VALE 1061have been funded by the European Commission within FP7 Projects 1062CHANGE (257422) and OPENLAB (287581). 1063.Sh CAVEATS 1064No matter how fast the CPU and OS are, 1065achieving line rate on 10G and faster interfaces 1066requires hardware with sufficient performance. 1067Several NICs are unable to sustain line rate with 1068small packet sizes. 1069Insufficient PCIe or memory bandwidth 1070can also cause reduced performance. 1071.Pp 1072Another frequent reason for low performance is the use 1073of flow control on the link: a slow receiver can limit 1074the transmit speed. 1075Be sure to disable flow control when running high 1076speed experiments. 1077.Ss SPECIAL NIC FEATURES 1078.Nm 1079is orthogonal to some NIC features such as 1080multiqueue, schedulers, packet filters. 1081.Pp 1082Multiple transmit and receive rings are supported natively 1083and can be configured with ordinary OS tools, 1084such as 1085.Xr ethtool 1086or 1087device-specific sysctl variables. 1088The same goes for Receive Packet Steering (RPS) 1089and filtering of incoming traffic. 1090.Pp 1091.Nm 1092.Em does not use 1093features such as 1094.Em checksum offloading , TCP segmentation offloading , 1095.Em encryption , VLAN encapsulation/decapsulation , 1096etc. 1097When using netmap to exchange packets with the host stack, 1098make sure to disable these features. 1099