1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd February 13, 2014 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.br 37.Nm VALE 38.Nd a fast VirtuAl Local Ethernet using the netmap API 39.br 40.Nm netmap pipes 41.Nd a shared memory packet transport channel 42.Sh SYNOPSIS 43.Cd device netmap 44.Sh DESCRIPTION 45.Nm 46is a framework for extremely fast and efficient packet I/O 47for both userspace and kernel clients. 48It runs on FreeBSD and Linux, 49and includes 50.Nm VALE , 51a very fast and modular in-kernel software switch/dataplane, 52and 53.Nm netmap pipes , 54a shared memory packet transport channel. 55All these are accessed interchangeably with the same API. 56.Pp 57.Nm , VALE 58and 59.Nm netmap pipes 60are at least one order of magnitude faster than 61standard OS mechanisms 62(sockets, bpf, tun/tap interfaces, native switches, pipes), 63reaching 14.88 million packets per second (Mpps) 64with much less than one core on a 10 Gbit NIC, 65about 20 Mpps per core for VALE ports, 66and over 100 Mpps for netmap pipes. 67.Pp 68Userspace clients can dynamically switch NICs into 69.Nm 70mode and send and receive raw packets through 71memory mapped buffers. 72Similarly, 73.Nm VALE 74switch instances and ports, and 75.Nm netmap pipes 76can be created dynamically, 77providing high speed packet I/O between processes, 78virtual machines, NICs and the host stack. 79.Pp 80.Nm 81suports both non-blocking I/O through 82.Xr ioctls() , 83synchronization and blocking I/O through a file descriptor 84and standard OS mechanisms such as 85.Xr select 2 , 86.Xr poll 2 , 87.Xr epoll 2 , 88.Xr kqueue 2 . 89.Nm VALE 90and 91.Nm netmap pipes 92are implemented by a single kernel module, which also emulates the 93.Nm 94API over standard drivers for devices without native 95.Nm 96support. 97For best performance, 98.Nm 99requires explicit support in device drivers. 100.Pp 101In the rest of this (long) manual page we document 102various aspects of the 103.Nm 104and 105.Nm VALE 106architecture, features and usage. 107.Sh ARCHITECTURE 108.Nm 109supports raw packet I/O through a 110.Em port , 111which can be connected to a physical interface 112.Em ( NIC ) , 113to the host stack, 114or to a 115.Nm VALE 116switch). 117Ports use preallocated circular queues of buffers 118.Em ( rings ) 119residing in an mmapped region. 120There is one ring for each transmit/receive queue of a 121NIC or virtual port. 122An additional ring pair connects to the host stack. 123.Pp 124After binding a file descriptor to a port, a 125.Nm 126client can send or receive packets in batches through 127the rings, and possibly implement zero-copy forwarding 128between ports. 129.Pp 130All NICs operating in 131.Nm 132mode use the same memory region, 133accessible to all processes who own 134.Nm /dev/netmap 135file descriptors bound to NICs. 136Independent 137.Nm VALE 138and 139.Nm netmap pipe 140ports 141by default use separate memory regions, 142but can be independently configured to share memory. 143.Sh ENTERING AND EXITING NETMAP MODE 144The following section describes the system calls to create 145and control 146.Nm netmap 147ports (including 148.Nm VALE 149and 150.Nm netmap pipe 151ports). 152Simpler, higher level functions are described in section 153.Xr LIBRARIES . 154.Pp 155Ports and rings are created and controlled through a file descriptor, 156created by opening a special device 157.Dl fd = open("/dev/netmap"); 158and then bound to a specific port with an 159.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 160.Pp 161.Nm 162has multiple modes of operation controlled by the 163.Vt struct nmreq 164argument. 165.Va arg.nr_name 166specifies the port name, as follows: 167.Bl -tag -width XXXX 168.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) 169the data path of the NIC is disconnected from the host stack, 170and the file descriptor is bound to the NIC (one or all queues), 171or to the host stack; 172.It Dv valeXXX:YYY (arbitrary XXX and YYY) 173the file descriptor is bound to port YYY of a VALE switch called XXX, 174both dynamically created if necessary. 175The string cannot exceed IFNAMSIZ characters, and YYY cannot 176be the name of any existing OS network interface. 177.El 178.Pp 179On return, 180.Va arg 181indicates the size of the shared memory region, 182and the number, size and location of all the 183.Nm 184data structures, which can be accessed by mmapping the memory 185.Dl char *mem = mmap(0, arg.nr_memsize, fd); 186.Pp 187Non blocking I/O is done with special 188.Xr ioctl 2 189.Xr select 2 190and 191.Xr poll 2 192on the file descriptor permit blocking I/O. 193.Xr epoll 2 194and 195.Xr kqueue 2 196are not supported on 197.Nm 198file descriptors. 199.Pp 200While a NIC is in 201.Nm 202mode, the OS will still believe the interface is up and running. 203OS-generated packets for that NIC end up into a 204.Nm 205ring, and another ring is used to send packets into the OS network stack. 206A 207.Xr close 2 208on the file descriptor removes the binding, 209and returns the NIC to normal mode (reconnecting the data path 210to the host stack), or destroys the virtual port. 211.Sh DATA STRUCTURES 212The data structures in the mmapped memory region are detailed in 213.Xr sys/net/netmap.h , 214which is the ultimate reference for the 215.Nm 216API. The main structures and fields are indicated below: 217.Bl -tag -width XXX 218.It Dv struct netmap_if (one per interface) 219.Bd -literal 220struct netmap_if { 221 ... 222 const uint32_t ni_flags; /* properties */ 223 ... 224 const uint32_t ni_tx_rings; /* NIC tx rings */ 225 const uint32_t ni_rx_rings; /* NIC rx rings */ 226 uint32_t ni_bufs_head; /* head of extra bufs list */ 227 ... 228}; 229.Ed 230.Pp 231Indicates the number of available rings 232.Pa ( struct netmap_rings ) 233and their position in the mmapped region. 234The number of tx and rx rings 235.Pa ( ni_tx_rings , ni_rx_rings ) 236normally depends on the hardware. 237NICs also have an extra tx/rx ring pair connected to the host stack. 238.Em NIOCREGIF 239can also request additional unbound buffers in the same memory space, 240to be used as temporary storage for packets. 241.Pa ni_bufs_head 242contains the index of the first of these free rings, 243which are connected in a list (the first uint32_t of each 244buffer being the index of the next buffer in the list). 245A 0 indicates the end of the list. 246.It Dv struct netmap_ring (one per ring) 247.Bd -literal 248struct netmap_ring { 249 ... 250 const uint32_t num_slots; /* slots in each ring */ 251 const uint32_t nr_buf_size; /* size of each buffer */ 252 ... 253 uint32_t head; /* (u) first buf owned by user */ 254 uint32_t cur; /* (u) wakeup position */ 255 const uint32_t tail; /* (k) first buf owned by kernel */ 256 ... 257 uint32_t flags; 258 struct timeval ts; /* (k) time of last rxsync() */ 259 ... 260 struct netmap_slot slot[0]; /* array of slots */ 261} 262.Ed 263.Pp 264Implements transmit and receive rings, with read/write 265pointers, metadata and and an array of 266.Pa slots 267describing the buffers. 268.It Dv struct netmap_slot (one per buffer) 269.Bd -literal 270struct netmap_slot { 271 uint32_t buf_idx; /* buffer index */ 272 uint16_t len; /* packet length */ 273 uint16_t flags; /* buf changed, etc. */ 274 uint64_t ptr; /* address for indirect buffers */ 275}; 276.Ed 277.Pp 278Describes a packet buffer, which normally is identified by 279an index and resides in the mmapped region. 280.It Dv packet buffers 281Fixed size (normally 2 KB) packet buffers allocated by the kernel. 282.El 283.Pp 284The offset of the 285.Pa struct netmap_if 286in the mmapped region is indicated by the 287.Pa nr_offset 288field in the structure returned by 289.Pa NIOCREGIF . 290From there, all other objects are reachable through 291relative references (offsets or indexes). 292Macros and functions in <net/netmap_user.h> 293help converting them into actual pointers: 294.Pp 295.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 296.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 297.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 298.Pp 299.Dl char *buf = NETMAP_BUF(ring, buffer_index); 300.Sh RINGS, BUFFERS AND DATA I/O 301.Va Rings 302are circular queues of packets with three indexes/pointers 303.Va ( head , cur , tail ) ; 304one slot is always kept empty. 305The ring size 306.Va ( num_slots ) 307should not be assumed to be a power of two. 308.br 309(NOTE: older versions of netmap used head/count format to indicate 310the content of a ring). 311.Pp 312.Va head 313is the first slot available to userspace; 314.br 315.Va cur 316is the wakeup point: 317select/poll will unblock when 318.Va tail 319passes 320.Va cur ; 321.br 322.Va tail 323is the first slot reserved to the kernel. 324.Pp 325Slot indexes MUST only move forward; 326for convenience, the function 327.Dl nm_ring_next(ring, index) 328returns the next index modulo the ring size. 329.Pp 330.Va head 331and 332.Va cur 333are only modified by the user program; 334.Va tail 335is only modified by the kernel. 336The kernel only reads/writes the 337.Vt struct netmap_ring 338slots and buffers 339during the execution of a netmap-related system call. 340The only exception are slots (and buffers) in the range 341.Va tail\ . . . head-1 , 342that are explicitly assigned to the kernel. 343.Pp 344.Ss TRANSMIT RINGS 345On transmit rings, after a 346.Nm 347system call, slots in the range 348.Va head\ . . . tail-1 349are available for transmission. 350User code should fill the slots sequentially 351and advance 352.Va head 353and 354.Va cur 355past slots ready to transmit. 356.Va cur 357may be moved further ahead if the user code needs 358more slots before further transmissions (see 359.Sx SCATTER GATHER I/O ) . 360.Pp 361At the next NIOCTXSYNC/select()/poll(), 362slots up to 363.Va head-1 364are pushed to the port, and 365.Va tail 366may advance if further slots have become available. 367Below is an example of the evolution of a TX ring: 368.Bd -literal 369 after the syscall, slots between cur and tail are (a)vailable 370 head=cur tail 371 | | 372 v v 373 TX [.....aaaaaaaaaaa.............] 374 375 user creates new packets to (T)ransmit 376 head=cur tail 377 | | 378 v v 379 TX [.....TTTTTaaaaaa.............] 380 381 NIOCTXSYNC/poll()/select() sends packets and reports new slots 382 head=cur tail 383 | | 384 v v 385 TX [..........aaaaaaaaaaa........] 386.Ed 387.Pp 388select() and poll() wlll block if there is no space in the ring, i.e. 389.Dl ring->cur == ring->tail 390and return when new slots have become available. 391.Pp 392High speed applications may want to amortize the cost of system calls 393by preparing as many packets as possible before issuing them. 394.Pp 395A transmit ring with pending transmissions has 396.Dl ring->head != ring->tail + 1 (modulo the ring size). 397The function 398.Va int nm_tx_pending(ring) 399implements this test. 400.Ss RECEIVE RINGS 401On receive rings, after a 402.Nm 403system call, the slots in the range 404.Va head\& . . . tail-1 405contain received packets. 406User code should process them and advance 407.Va head 408and 409.Va cur 410past slots it wants to return to the kernel. 411.Va cur 412may be moved further ahead if the user code wants to 413wait for more packets 414without returning all the previous slots to the kernel. 415.Pp 416At the next NIOCRXSYNC/select()/poll(), 417slots up to 418.Va head-1 419are returned to the kernel for further receives, and 420.Va tail 421may advance to report new incoming packets. 422.br 423Below is an example of the evolution of an RX ring: 424.Bd -literal 425 after the syscall, there are some (h)eld and some (R)eceived slots 426 head cur tail 427 | | | 428 v v v 429 RX [..hhhhhhRRRRRRRR..........] 430 431 user advances head and cur, releasing some slots and holding others 432 head cur tail 433 | | | 434 v v v 435 RX [..*****hhhRRRRRR...........] 436 437 NICRXSYNC/poll()/select() recovers slots and reports new packets 438 head cur tail 439 | | | 440 v v v 441 RX [.......hhhRRRRRRRRRRRR....] 442.Ed 443.Sh SLOTS AND PACKET BUFFERS 444Normally, packets should be stored in the netmap-allocated buffers 445assigned to slots when ports are bound to a file descriptor. 446One packet is fully contained in a single buffer. 447.Pp 448The following flags affect slot and buffer processing: 449.Bl -tag -width XXX 450.It NS_BUF_CHANGED 451it MUST be used when the buf_idx in the slot is changed. 452This can be used to implement 453zero-copy forwarding, see 454.Sx ZERO-COPY FORWARDING . 455.It NS_REPORT 456reports when this buffer has been transmitted. 457Normally, 458.Nm 459notifies transmit completions in batches, hence signals 460can be delayed indefinitely. This flag helps detecting 461when packets have been send and a file descriptor can be closed. 462.It NS_FORWARD 463When a ring is in 'transparent' mode (see 464.Sx TRANSPARENT MODE ) , 465packets marked with this flags are forwarded to the other endpoint 466at the next system call, thus restoring (in a selective way) 467the connection between a NIC and the host stack. 468.It NS_NO_LEARN 469tells the forwarding code that the SRC MAC address for this 470packet must not be used in the learning bridge code. 471.It NS_INDIRECT 472indicates that the packet's payload is in a user-supplied buffer, 473whose user virtual address is in the 'ptr' field of the slot. 474The size can reach 65535 bytes. 475.br 476This is only supported on the transmit ring of 477.Nm VALE 478ports, and it helps reducing data copies in the interconnection 479of virtual machines. 480.It NS_MOREFRAG 481indicates that the packet continues with subsequent buffers; 482the last buffer in a packet must have the flag clear. 483.El 484.Sh SCATTER GATHER I/O 485Packets can span multiple slots if the 486.Va NS_MOREFRAG 487flag is set in all but the last slot. 488The maximum length of a chain is 64 buffers. 489This is normally used with 490.Nm VALE 491ports when connecting virtual machines, as they generate large 492TSO segments that are not split unless they reach a physical device. 493.Pp 494NOTE: The length field always refers to the individual 495fragment; there is no place with the total length of a packet. 496.Pp 497On receive rings the macro 498.Va NS_RFRAGS(slot) 499indicates the remaining number of slots for this packet, 500including the current one. 501Slots with a value greater than 1 also have NS_MOREFRAG set. 502.Sh IOCTLS 503.Nm 504uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 505for non-blocking I/O. They take no argument. 506Two more ioctls (NIOCGINFO, NIOCREGIF) are used 507to query and configure ports, with the following argument: 508.Bd -literal 509struct nmreq { 510 char nr_name[IFNAMSIZ]; /* (i) port name */ 511 uint32_t nr_version; /* (i) API version */ 512 uint32_t nr_offset; /* (o) nifp offset in mmap region */ 513 uint32_t nr_memsize; /* (o) size of the mmap region */ 514 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ 515 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ 516 uint16_t nr_tx_rings; /* (i/o) number of tx rings */ 517 uint16_t nr_rx_rings; /* (i/o) number of tx rings */ 518 uint16_t nr_ringid; /* (i/o) ring(s) we care about */ 519 uint16_t nr_cmd; /* (i) special command */ 520 uint16_t nr_arg1; /* (i/o) extra arguments */ 521 uint16_t nr_arg2; /* (i/o) extra arguments */ 522 uint32_t nr_arg3; /* (i/o) extra arguments */ 523 uint32_t nr_flags /* (i/o) open mode */ 524 ... 525}; 526.Ed 527.Pp 528A file descriptor obtained through 529.Pa /dev/netmap 530also supports the ioctl supported by network devices, see 531.Xr netintro 4 . 532.Bl -tag -width XXXX 533.It Dv NIOCGINFO 534returns EINVAL if the named port does not support netmap. 535Otherwise, it returns 0 and (advisory) information 536about the port. 537Note that all the information below can change before the 538interface is actually put in netmap mode. 539.Bl -tag -width XX 540.It Pa nr_memsize 541indicates the size of the 542.Nm 543memory region. NICs in 544.Nm 545mode all share the same memory region, 546whereas 547.Nm VALE 548ports have independent regions for each port. 549.It Pa nr_tx_slots , nr_rx_slots 550indicate the size of transmit and receive rings. 551.It Pa nr_tx_rings , nr_rx_rings 552indicate the number of transmit 553and receive rings. 554Both ring number and sizes may be configured at runtime 555using interface-specific functions (e.g. 556.Xr ethtool 557). 558.El 559.It Dv NIOCREGIF 560binds the port named in 561.Va nr_name 562to the file descriptor. For a physical device this also switches it into 563.Nm 564mode, disconnecting 565it from the host stack. 566Multiple file descriptors can be bound to the same port, 567with proper synchronization left to the user. 568.Pp 569.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a 570.Em netmap pipe , 571consisting of two netmap ports with a crossover connection. 572A netmap pipe share the same memory space of the parent port, 573and is meant to enable configuration where a master process acts 574as a dispatcher towards slave processes. 575.Pp 576To enable this function, the 577.Pa nr_arg1 578field of the structure can be used as a hint to the kernel to 579indicate how many pipes we expect to use, and reserve extra space 580in the memory region. 581.Pp 582On return, it gives the same info as NIOCGINFO, 583with 584.Pa nr_ringid 585and 586.Pa nr_flags 587indicating the identity of the rings controlled through the file 588descriptor. 589.Pp 590.Va nr_flags 591.Va nr_ringid 592selects which rings are controlled through this file descriptor. 593Possible values of 594.Pa nr_flags 595are indicated below, together with the naming schemes 596that application libraries (such as the 597.Nm nm_open 598indicated below) can use to indicate the specific set of rings. 599In the example below, "netmap:foo" is any valid netmap port name. 600.Bl -tag -width XXXXX 601.It NR_REG_ALL_NIC "netmap:foo" 602(default) all hardware ring pairs 603.It NR_REG_SW "netmap:foo^" 604the ``host rings'', connecting to the host stack. 605.It NR_REG_NIC_SW "netmap:foo+" 606all hardware rings and the host rings 607.It NR_REG_ONE_NIC "netmap:foo-i" 608only the i-th hardware ring pair, where the number is in 609.Pa nr_ringid ; 610.It NR_REG_PIPE_MASTER "netmap:foo{i" 611the master side of the netmap pipe whose identifier (i) is in 612.Pa nr_ringid ; 613.It NR_REG_PIPE_SLAVE "netmap:foo}i" 614the slave side of the netmap pipe whose identifier (i) is in 615.Pa nr_ringid . 616.Pp 617The identifier of a pipe must be thought as part of the pipe name, 618and does not need to be sequential. On return the pipe 619will only have a single ring pair with index 0, 620irrespective of the value of i. 621.El 622.Pp 623By default, a 624.Xr poll 2 625or 626.Xr select 2 627call pushes out any pending packets on the transmit ring, even if 628no write events are specified. 629The feature can be disabled by or-ing 630.Va NETMAP_NO_TX_POLL 631to the value written to 632.Va nr_ringid. 633When this feature is used, 634packets are transmitted only on 635.Va ioctl(NIOCTXSYNC) 636or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. 637.Pp 638When registering a virtual interface that is dynamically created to a 639.Xr vale 4 640switch, we can specify the desired number of rings (1 by default, 641and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 642.It Dv NIOCTXSYNC 643tells the hardware of new packets to transmit, and updates the 644number of slots available for transmission. 645.It Dv NIOCRXSYNC 646tells the hardware of consumed packets, and asks for newly available 647packets. 648.El 649.Sh SELECT, POLL, EPOLL, KQUEUE. 650.Xr select 2 651and 652.Xr poll 2 653on a 654.Nm 655file descriptor process rings as indicated in 656.Sx TRANSMIT RINGS 657and 658.Sx RECEIVE RINGS , 659respectively when write (POLLOUT) and read (POLLIN) events are requested. 660Both block if no slots are available in the ring 661.Va ( ring->cur == ring->tail ) . 662Depending on the platform, 663.Xr epoll 2 664and 665.Xr kqueue 2 666are supported too. 667.Pp 668Packets in transmit rings are normally pushed out 669(and buffers reclaimed) even without 670requesting write events. Passing the NETMAP_NO_TX_POLL flag to 671.Em NIOCREGIF 672disables this feature. 673By default, receive rings are processed only if read 674events are requested. Passing the NETMAP_DO_RX_POLL flag to 675.Em NIOCREGIF updates receive rings even without read events. 676Note that on epoll and kqueue, NETMAP_NO_TX_POLL and NETMAP_DO_RX_POLL 677only have an effect when some event is posted for the file descriptor. 678.Sh LIBRARIES 679The 680.Nm 681API is supposed to be used directly, both because of its simplicity and 682for efficient integration with applications. 683.Pp 684For conveniency, the 685.Va <net/netmap_user.h> 686header provides a few macros and functions to ease creating 687a file descriptor and doing I/O with a 688.Nm 689port. These are loosely modeled after the 690.Xr pcap 3 691API, to ease porting of libpcap-based applications to 692.Nm . 693To use these extra functions, programs should 694.Dl #define NETMAP_WITH_LIBS 695before 696.Dl #include <net/netmap_user.h> 697.Pp 698The following functions are available: 699.Bl -tag -width XXXXX 700.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) 701similar to 702.Xr pcap_open , 703binds a file descriptor to a port. 704.Bl -tag -width XX 705.It Va ifname 706is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a 707.Nm VALE 708port. 709.It Va req 710provides the initial values for the argument to the NIOCREGIF ioctl. 711The nm_flags and nm_ringid values are overwritten by parsing 712ifname and flags, and other fields can be overridden through 713the other two arguments. 714.It Va arg 715points to a struct nm_desc containing arguments (e.g. from a previously 716open file descriptor) that should override the defaults. 717The fields are used as described below 718.It Va flags 719can be set to a combination of the following flags: 720.Va NETMAP_NO_TX_POLL , 721.Va NETMAP_DO_RX_POLL 722(copied into nr_ringid); 723.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, 724avoids the mmap and uses the values from it); 725.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); 726.Va NM_OPEN_ARG1 , 727.Va NM_OPEN_ARG2 , 728.Va NM_OPEN_ARG3 (uses the fields from arg); 729.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). 730.El 731.It Va int nm_close(struct nm_desc *d) 732closes the file descriptor, unmaps memory, frees resources. 733.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) 734similar to pcap_inject(), pushes a packet to a ring, returns the size 735of the packet is successful, or 0 on error; 736.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) 737similar to pcap_dispatch(), applies a callback to incoming packets 738.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) 739similar to pcap_next(), fetches the next packet 740.El 741.Sh SUPPORTED DEVICES 742.Nm 743natively supports the following devices: 744.Pp 745On FreeBSD: 746.Xr em 4 , 747.Xr igb 4 , 748.Xr ixgbe 4 , 749.Xr lem 4 , 750.Xr re 4 . 751.Pp 752On Linux 753.Xr e1000 4 , 754.Xr e1000e 4 , 755.Xr igb 4 , 756.Xr ixgbe 4 , 757.Xr mlx4 4 , 758.Xr forcedeth 4 , 759.Xr r8169 4 . 760.Pp 761NICs without native support can still be used in 762.Nm 763mode through emulation. Performance is inferior to native netmap 764mode but still significantly higher than sockets, and approaching 765that of in-kernel solutions such as Linux's 766.Xr pktgen . 767.Pp 768Emulation is also available for devices with native netmap support, 769which can be used for testing or performance comparison. 770The sysctl variable 771.Va dev.netmap.admode 772globally controls how netmap mode is implemented. 773.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 774Some aspect of the operation of 775.Nm 776are controlled through sysctl variables on FreeBSD 777.Em ( dev.netmap.* ) 778and module parameters on Linux 779.Em ( /sys/module/netmap_lin/parameters/* ) : 780.Bl -tag -width indent 781.It Va dev.netmap.admode: 0 782Controls the use of native or emulated adapter mode. 7830 uses the best available option, 1 forces native and 784fails if not available, 2 forces emulated hence never fails. 785.It Va dev.netmap.generic_ringsize: 1024 786Ring size used for emulated netmap mode 787.It Va dev.netmap.generic_mit: 100000 788Controls interrupt moderation for emulated mode 789.It Va dev.netmap.mmap_unreg: 0 790.It Va dev.netmap.fwd: 0 791Forces NS_FORWARD mode 792.It Va dev.netmap.flags: 0 793.It Va dev.netmap.txsync_retry: 2 794.It Va dev.netmap.no_pendintr: 1 795Forces recovery of transmit buffers on system calls 796.It Va dev.netmap.mitigate: 1 797Propagates interrupt mitigation to user processes 798.It Va dev.netmap.no_timestamp: 0 799Disables the update of the timestamp in the netmap ring 800.It Va dev.netmap.verbose: 0 801Verbose kernel messages 802.It Va dev.netmap.buf_num: 163840 803.It Va dev.netmap.buf_size: 2048 804.It Va dev.netmap.ring_num: 200 805.It Va dev.netmap.ring_size: 36864 806.It Va dev.netmap.if_num: 100 807.It Va dev.netmap.if_size: 1024 808Sizes and number of objects (netmap_if, netmap_ring, buffers) 809for the global memory region. The only parameter worth modifying is 810.Va dev.netmap.buf_num 811as it impacts the total amount of memory used by netmap. 812.It Va dev.netmap.buf_curr_num: 0 813.It Va dev.netmap.buf_curr_size: 0 814.It Va dev.netmap.ring_curr_num: 0 815.It Va dev.netmap.ring_curr_size: 0 816.It Va dev.netmap.if_curr_num: 0 817.It Va dev.netmap.if_curr_size: 0 818Actual values in use. 819.It Va dev.netmap.bridge_batch: 1024 820Batch size used when moving packets across a 821.Nm VALE 822switch. Values above 64 generally guarantee good 823performance. 824.El 825.Sh SYSTEM CALLS 826.Nm 827uses 828.Xr select 2 , 829.Xr poll 2 , 830.Xr epoll 831and 832.Xr kqueue 833to wake up processes when significant events occur, and 834.Xr mmap 2 835to map memory. 836.Xr ioctl 2 837is used to configure ports and 838.Nm VALE switches . 839.Pp 840Applications may need to create threads and bind them to 841specific cores to improve performance, using standard 842OS primitives, see 843.Xr pthread 3 . 844In particular, 845.Xr pthread_setaffinity_np 3 846may be of use. 847.Sh EXAMPLES 848.Ss TEST PROGRAMS 849.Nm 850comes with a few programs that can be used for testing or 851simple applications. 852See the 853.Va examples/ 854directory in 855.Nm 856distributions, or 857.Va tools/tools/netmap/ 858directory in FreeBSD distributions. 859.Pp 860.Xr pkt-gen 861is a general purpose traffic source/sink. 862.Pp 863As an example 864.Dl pkt-gen -i ix0 -f tx -l 60 865can generate an infinite stream of minimum size packets, and 866.Dl pkt-gen -i ix0 -f rx 867is a traffic sink. 868Both print traffic statistics, to help monitor 869how the system performs. 870.Pp 871.Xr pkt-gen 872has many options can be uses to set packet sizes, addresses, 873rates, and use multiple send/receive threads and cores. 874.Pp 875.Xr bridge 876is another test program which interconnects two 877.Nm 878ports. It can be used for transparent forwarding between 879interfaces, as in 880.Dl bridge -i ix0 -i ix1 881or even connect the NIC to the host stack using netmap 882.Dl bridge -i ix0 -i ix0 883.Ss USING THE NATIVE API 884The following code implements a traffic generator 885.Pp 886.Bd -literal -compact 887#include <net/netmap_user.h> 888\&... 889void sender(void) 890{ 891 struct netmap_if *nifp; 892 struct netmap_ring *ring; 893 struct nmreq nmr; 894 struct pollfd fds; 895 896 fd = open("/dev/netmap", O_RDWR); 897 bzero(&nmr, sizeof(nmr)); 898 strcpy(nmr.nr_name, "ix0"); 899 nmr.nm_version = NETMAP_API; 900 ioctl(fd, NIOCREGIF, &nmr); 901 p = mmap(0, nmr.nr_memsize, fd); 902 nifp = NETMAP_IF(p, nmr.nr_offset); 903 ring = NETMAP_TXRING(nifp, 0); 904 fds.fd = fd; 905 fds.events = POLLOUT; 906 for (;;) { 907 poll(&fds, 1, -1); 908 while (!nm_ring_empty(ring)) { 909 i = ring->cur; 910 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 911 ... prepare packet in buf ... 912 ring->slot[i].len = ... packet length ... 913 ring->head = ring->cur = nm_ring_next(ring, i); 914 } 915 } 916} 917.Ed 918.Ss HELPER FUNCTIONS 919A simple receiver can be implemented using the helper functions 920.Bd -literal -compact 921#define NETMAP_WITH_LIBS 922#include <net/netmap_user.h> 923\&... 924void receiver(void) 925{ 926 struct nm_desc *d; 927 struct pollfd fds; 928 u_char *buf; 929 struct nm_pkthdr h; 930 ... 931 d = nm_open("netmap:ix0", NULL, 0, 0); 932 fds.fd = NETMAP_FD(d); 933 fds.events = POLLIN; 934 for (;;) { 935 poll(&fds, 1, -1); 936 while ( (buf = nm_nextpkt(d, &h)) ) 937 consume_pkt(buf, h->len); 938 } 939 nm_close(d); 940} 941.Ed 942.Ss ZERO-COPY FORWARDING 943Since physical interfaces share the same memory region, 944it is possible to do packet forwarding between ports 945swapping buffers. The buffer from the transmit ring is used 946to replenish the receive ring: 947.Bd -literal -compact 948 uint32_t tmp; 949 struct netmap_slot *src, *dst; 950 ... 951 src = &src_ring->slot[rxr->cur]; 952 dst = &dst_ring->slot[txr->cur]; 953 tmp = dst->buf_idx; 954 dst->buf_idx = src->buf_idx; 955 dst->len = src->len; 956 dst->flags = NS_BUF_CHANGED; 957 src->buf_idx = tmp; 958 src->flags = NS_BUF_CHANGED; 959 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 960 txr->head = txr->cur = nm_ring_next(txr, txr->cur); 961 ... 962.Ed 963.Ss ACCESSING THE HOST STACK 964The host stack is for all practical purposes just a regular ring pair, 965which you can access with the netmap API (e.g. with 966.Dl nm_open("netmap:eth0^", ... ) ; 967All packets that the host would send to an interface in 968.Nm 969mode end up into the RX ring, whereas all packets queued to the 970TX ring are send up to the host stack. 971.Ss VALE SWITCH 972A simple way to test the performance of a 973.Nm VALE 974switch is to attach a sender and a receiver to it, 975e.g. running the following in two different terminals: 976.Dl pkt-gen -i vale1:a -f rx # receiver 977.Dl pkt-gen -i vale1:b -f tx # sender 978The same example can be used to test netmap pipes, by simply 979changing port names, e.g. 980.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side 981.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side 982.Pp 983The following command attaches an interface and the host stack 984to a switch: 985.Dl vale-ctl -h vale2:em0 986Other 987.Nm 988clients attached to the same switch can now communicate 989with the network card or the host. 990.Sh SEE ALSO 991.Pa http://info.iet.unipi.it/~luigi/netmap/ 992.Pp 993Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 994Communications of the ACM, 55 (3), pp.45-51, March 2012 995.Pp 996Luigi Rizzo, netmap: a novel framework for fast packet I/O, 997Usenix ATC'12, June 2012, Boston 998.Pp 999Luigi Rizzo, Giuseppe Lettieri, 1000VALE, a switched ethernet for virtual machines, 1001ACM CoNEXT'12, December 2012, Nice 1002.Pp 1003Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, 1004Speeding up packet I/O in virtual machines, 1005ACM/IEEE ANCS'13, October 2013, San Jose 1006.Sh AUTHORS 1007.An -nosplit 1008The 1009.Nm 1010framework has been originally designed and implemented at the 1011Universita` di Pisa in 2011 by 1012.An Luigi Rizzo , 1013and further extended with help from 1014.An Matteo Landi , 1015.An Gaetano Catalli , 1016.An Giuseppe Lettieri , 1017.An Vincenzo Maffione . 1018.Pp 1019.Nm 1020and 1021.Nm VALE 1022have been funded by the European Commission within FP7 Projects 1023CHANGE (257422) and OPENLAB (287581). 1024.Sh CAVEATS 1025No matter how fast the CPU and OS are, 1026achieving line rate on 10G and faster interfaces 1027requires hardware with sufficient performance. 1028Several NICs are unable to sustain line rate with 1029small packet sizes. Insufficient PCIe or memory bandwidth 1030can also cause reduced performance. 1031.Pp 1032Another frequent reason for low performance is the use 1033of flow control on the link: a slow receiver can limit 1034the transmit speed. 1035Be sure to disable flow control when running high 1036speed experiments. 1037.Pp 1038.Ss SPECIAL NIC FEATURES 1039.Nm 1040is orthogonal to some NIC features such as 1041multiqueue, schedulers, packet filters. 1042.Pp 1043Multiple transmit and receive rings are supported natively 1044and can be configured with ordinary OS tools, 1045such as 1046.Xr ethtool 1047or 1048device-specific sysctl variables. 1049The same goes for Receive Packet Steering (RPS) 1050and filtering of incoming traffic. 1051.Pp 1052.Nm 1053.Em does not use 1054features such as 1055.Em checksum offloading , TCP segmentation offloading , 1056.Em encryption , VLAN encapsulation/decapsulation , 1057etc. . 1058When using netmap to exchange packets with the host stack, 1059make sure to disable these features. 1060