1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD$ 29.\" 30.Dd October 3, 2020 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.Sh SYNOPSIS 37.Cd device netmap 38.Sh DESCRIPTION 39.Nm 40is a framework for extremely fast and efficient packet I/O 41for userspace and kernel clients, and for Virtual Machines. 42It runs on 43.Fx , 44Linux and some versions of Windows, and supports a variety of 45.Nm netmap ports , 46including 47.Bl -tag -width XXXX 48.It Nm physical NIC ports 49to access individual queues of network interfaces; 50.It Nm host ports 51to inject packets into the host stack; 52.It Nm VALE ports 53implementing a very fast and modular in-kernel software switch/dataplane; 54.It Nm netmap pipes 55a shared memory packet transport channel; 56.It Nm netmap monitors 57a mechanism similar to 58.Xr bpf 4 59to capture traffic 60.El 61.Pp 62All these 63.Nm netmap ports 64are accessed interchangeably with the same API, 65and are at least one order of magnitude faster than 66standard OS mechanisms 67(sockets, bpf, tun/tap interfaces, native switches, pipes). 68With suitably fast hardware (NICs, PCIe buses, CPUs), 69packet I/O using 70.Nm 71on supported NICs 72reaches 14.88 million packets per second (Mpps) 73with much less than one core on 10 Gbit/s NICs; 7435-40 Mpps on 40 Gbit/s NICs (limited by the hardware); 75about 20 Mpps per core for VALE ports; 76and over 100 Mpps for 77.Nm netmap pipes . 78NICs without native 79.Nm 80support can still use the API in emulated mode, 81which uses unmodified device drivers and is 3-5 times faster than 82.Xr bpf 4 83or raw sockets. 84.Pp 85Userspace clients can dynamically switch NICs into 86.Nm 87mode and send and receive raw packets through 88memory mapped buffers. 89Similarly, 90.Nm VALE 91switch instances and ports, 92.Nm netmap pipes 93and 94.Nm netmap monitors 95can be created dynamically, 96providing high speed packet I/O between processes, 97virtual machines, NICs and the host stack. 98.Pp 99.Nm 100supports both non-blocking I/O through 101.Xr ioctl 2 , 102synchronization and blocking I/O through a file descriptor 103and standard OS mechanisms such as 104.Xr select 2 , 105.Xr poll 2 , 106.Xr kqueue 2 107and 108.Xr epoll 7 . 109All types of 110.Nm netmap ports 111and the 112.Nm VALE switch 113are implemented by a single kernel module, which also emulates the 114.Nm 115API over standard drivers. 116For best performance, 117.Nm 118requires native support in device drivers. 119A list of such devices is at the end of this document. 120.Pp 121In the rest of this (long) manual page we document 122various aspects of the 123.Nm 124and 125.Nm VALE 126architecture, features and usage. 127.Sh ARCHITECTURE 128.Nm 129supports raw packet I/O through a 130.Em port , 131which can be connected to a physical interface 132.Em ( NIC ) , 133to the host stack, 134or to a 135.Nm VALE 136switch. 137Ports use preallocated circular queues of buffers 138.Em ( rings ) 139residing in an mmapped region. 140There is one ring for each transmit/receive queue of a 141NIC or virtual port. 142An additional ring pair connects to the host stack. 143.Pp 144After binding a file descriptor to a port, a 145.Nm 146client can send or receive packets in batches through 147the rings, and possibly implement zero-copy forwarding 148between ports. 149.Pp 150All NICs operating in 151.Nm 152mode use the same memory region, 153accessible to all processes who own 154.Pa /dev/netmap 155file descriptors bound to NICs. 156Independent 157.Nm VALE 158and 159.Nm netmap pipe 160ports 161by default use separate memory regions, 162but can be independently configured to share memory. 163.Sh ENTERING AND EXITING NETMAP MODE 164The following section describes the system calls to create 165and control 166.Nm netmap 167ports (including 168.Nm VALE 169and 170.Nm netmap pipe 171ports). 172Simpler, higher level functions are described in the 173.Sx LIBRARIES 174section. 175.Pp 176Ports and rings are created and controlled through a file descriptor, 177created by opening a special device 178.Dl fd = open("/dev/netmap"); 179and then bound to a specific port with an 180.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 181.Pp 182.Nm 183has multiple modes of operation controlled by the 184.Vt struct nmreq 185argument. 186.Va arg.nr_name 187specifies the netmap port name, as follows: 188.Bl -tag -width XXXX 189.It Dv OS network interface name (e.g., 'em0', 'eth1', ... ) 190the data path of the NIC is disconnected from the host stack, 191and the file descriptor is bound to the NIC (one or all queues), 192or to the host stack; 193.It Dv valeSSS:PPP 194the file descriptor is bound to port PPP of VALE switch SSS. 195Switch instances and ports are dynamically created if necessary. 196.Pp 197Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string 198cannot exceed IFNAMSIZ characters, and PPP cannot 199be the name of any existing OS network interface. 200.El 201.Pp 202On return, 203.Va arg 204indicates the size of the shared memory region, 205and the number, size and location of all the 206.Nm 207data structures, which can be accessed by mmapping the memory 208.Dl char *mem = mmap(0, arg.nr_memsize, fd); 209.Pp 210Non-blocking I/O is done with special 211.Xr ioctl 2 212.Xr select 2 213and 214.Xr poll 2 215on the file descriptor permit blocking I/O. 216.Pp 217While a NIC is in 218.Nm 219mode, the OS will still believe the interface is up and running. 220OS-generated packets for that NIC end up into a 221.Nm 222ring, and another ring is used to send packets into the OS network stack. 223A 224.Xr close 2 225on the file descriptor removes the binding, 226and returns the NIC to normal mode (reconnecting the data path 227to the host stack), or destroys the virtual port. 228.Sh DATA STRUCTURES 229The data structures in the mmapped memory region are detailed in 230.In sys/net/netmap.h , 231which is the ultimate reference for the 232.Nm 233API. 234The main structures and fields are indicated below: 235.Bl -tag -width XXX 236.It Dv struct netmap_if (one per interface ) 237.Bd -literal 238struct netmap_if { 239 ... 240 const uint32_t ni_flags; /* properties */ 241 ... 242 const uint32_t ni_tx_rings; /* NIC tx rings */ 243 const uint32_t ni_rx_rings; /* NIC rx rings */ 244 uint32_t ni_bufs_head; /* head of extra bufs list */ 245 ... 246}; 247.Ed 248.Pp 249Indicates the number of available rings 250.Pa ( struct netmap_rings ) 251and their position in the mmapped region. 252The number of tx and rx rings 253.Pa ( ni_tx_rings , ni_rx_rings ) 254normally depends on the hardware. 255NICs also have an extra tx/rx ring pair connected to the host stack. 256.Em NIOCREGIF 257can also request additional unbound buffers in the same memory space, 258to be used as temporary storage for packets. 259The number of extra 260buffers is specified in the 261.Va arg.nr_arg3 262field. 263On success, the kernel writes back to 264.Va arg.nr_arg3 265the number of extra buffers actually allocated (they may be less 266than the amount requested if the memory space ran out of buffers). 267.Pa ni_bufs_head 268contains the index of the first of these extra buffers, 269which are connected in a list (the first uint32_t of each 270buffer being the index of the next buffer in the list). 271A 272.Dv 0 273indicates the end of the list. 274The application is free to modify 275this list and use the buffers (i.e., binding them to the slots of a 276netmap ring). 277When closing the netmap file descriptor, 278the kernel frees the buffers contained in the list pointed by 279.Pa ni_bufs_head 280, irrespectively of the buffers originally provided by the kernel on 281.Em NIOCREGIF . 282.It Dv struct netmap_ring (one per ring ) 283.Bd -literal 284struct netmap_ring { 285 ... 286 const uint32_t num_slots; /* slots in each ring */ 287 const uint32_t nr_buf_size; /* size of each buffer */ 288 ... 289 uint32_t head; /* (u) first buf owned by user */ 290 uint32_t cur; /* (u) wakeup position */ 291 const uint32_t tail; /* (k) first buf owned by kernel */ 292 ... 293 uint32_t flags; 294 struct timeval ts; /* (k) time of last rxsync() */ 295 ... 296 struct netmap_slot slot[0]; /* array of slots */ 297} 298.Ed 299.Pp 300Implements transmit and receive rings, with read/write 301pointers, metadata and an array of 302.Em slots 303describing the buffers. 304.It Dv struct netmap_slot (one per buffer ) 305.Bd -literal 306struct netmap_slot { 307 uint32_t buf_idx; /* buffer index */ 308 uint16_t len; /* packet length */ 309 uint16_t flags; /* buf changed, etc. */ 310 uint64_t ptr; /* address for indirect buffers */ 311}; 312.Ed 313.Pp 314Describes a packet buffer, which normally is identified by 315an index and resides in the mmapped region. 316.It Dv packet buffers 317Fixed size (normally 2 KB) packet buffers allocated by the kernel. 318.El 319.Pp 320The offset of the 321.Pa struct netmap_if 322in the mmapped region is indicated by the 323.Pa nr_offset 324field in the structure returned by 325.Dv NIOCREGIF . 326From there, all other objects are reachable through 327relative references (offsets or indexes). 328Macros and functions in 329.In net/netmap_user.h 330help converting them into actual pointers: 331.Pp 332.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 333.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 334.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 335.Pp 336.Dl char *buf = NETMAP_BUF(ring, buffer_index); 337.Sh RINGS, BUFFERS AND DATA I/O 338.Va Rings 339are circular queues of packets with three indexes/pointers 340.Va ( head , cur , tail ) ; 341one slot is always kept empty. 342The ring size 343.Va ( num_slots ) 344should not be assumed to be a power of two. 345.Pp 346.Va head 347is the first slot available to userspace; 348.Pp 349.Va cur 350is the wakeup point: 351select/poll will unblock when 352.Va tail 353passes 354.Va cur ; 355.Pp 356.Va tail 357is the first slot reserved to the kernel. 358.Pp 359Slot indexes 360.Em must 361only move forward; 362for convenience, the function 363.Dl nm_ring_next(ring, index) 364returns the next index modulo the ring size. 365.Pp 366.Va head 367and 368.Va cur 369are only modified by the user program; 370.Va tail 371is only modified by the kernel. 372The kernel only reads/writes the 373.Vt struct netmap_ring 374slots and buffers 375during the execution of a netmap-related system call. 376The only exception are slots (and buffers) in the range 377.Va tail\ . . . head-1 , 378that are explicitly assigned to the kernel. 379.Ss TRANSMIT RINGS 380On transmit rings, after a 381.Nm 382system call, slots in the range 383.Va head\ . . . tail-1 384are available for transmission. 385User code should fill the slots sequentially 386and advance 387.Va head 388and 389.Va cur 390past slots ready to transmit. 391.Va cur 392may be moved further ahead if the user code needs 393more slots before further transmissions (see 394.Sx SCATTER GATHER I/O ) . 395.Pp 396At the next NIOCTXSYNC/select()/poll(), 397slots up to 398.Va head-1 399are pushed to the port, and 400.Va tail 401may advance if further slots have become available. 402Below is an example of the evolution of a TX ring: 403.Bd -literal 404 after the syscall, slots between cur and tail are (a)vailable 405 head=cur tail 406 | | 407 v v 408 TX [.....aaaaaaaaaaa.............] 409 410 user creates new packets to (T)ransmit 411 head=cur tail 412 | | 413 v v 414 TX [.....TTTTTaaaaaa.............] 415 416 NIOCTXSYNC/poll()/select() sends packets and reports new slots 417 head=cur tail 418 | | 419 v v 420 TX [..........aaaaaaaaaaa........] 421.Ed 422.Pp 423.Fn select 424and 425.Fn poll 426will block if there is no space in the ring, i.e., 427.Dl ring->cur == ring->tail 428and return when new slots have become available. 429.Pp 430High speed applications may want to amortize the cost of system calls 431by preparing as many packets as possible before issuing them. 432.Pp 433A transmit ring with pending transmissions has 434.Dl ring->head != ring->tail + 1 (modulo the ring size). 435The function 436.Va int nm_tx_pending(ring) 437implements this test. 438.Ss RECEIVE RINGS 439On receive rings, after a 440.Nm 441system call, the slots in the range 442.Va head\& . . . tail-1 443contain received packets. 444User code should process them and advance 445.Va head 446and 447.Va cur 448past slots it wants to return to the kernel. 449.Va cur 450may be moved further ahead if the user code wants to 451wait for more packets 452without returning all the previous slots to the kernel. 453.Pp 454At the next NIOCRXSYNC/select()/poll(), 455slots up to 456.Va head-1 457are returned to the kernel for further receives, and 458.Va tail 459may advance to report new incoming packets. 460.Pp 461Below is an example of the evolution of an RX ring: 462.Bd -literal 463 after the syscall, there are some (h)eld and some (R)eceived slots 464 head cur tail 465 | | | 466 v v v 467 RX [..hhhhhhRRRRRRRR..........] 468 469 user advances head and cur, releasing some slots and holding others 470 head cur tail 471 | | | 472 v v v 473 RX [..*****hhhRRRRRR...........] 474 475 NICRXSYNC/poll()/select() recovers slots and reports new packets 476 head cur tail 477 | | | 478 v v v 479 RX [.......hhhRRRRRRRRRRRR....] 480.Ed 481.Sh SLOTS AND PACKET BUFFERS 482Normally, packets should be stored in the netmap-allocated buffers 483assigned to slots when ports are bound to a file descriptor. 484One packet is fully contained in a single buffer. 485.Pp 486The following flags affect slot and buffer processing: 487.Bl -tag -width XXX 488.It NS_BUF_CHANGED 489.Em must 490be used when the 491.Va buf_idx 492in the slot is changed. 493This can be used to implement 494zero-copy forwarding, see 495.Sx ZERO-COPY FORWARDING . 496.It NS_REPORT 497reports when this buffer has been transmitted. 498Normally, 499.Nm 500notifies transmit completions in batches, hence signals 501can be delayed indefinitely. 502This flag helps detect 503when packets have been sent and a file descriptor can be closed. 504.It NS_FORWARD 505When a ring is in 'transparent' mode, 506packets marked with this flag by the user application are forwarded to the 507other endpoint at the next system call, thus restoring (in a selective way) 508the connection between a NIC and the host stack. 509.It NS_NO_LEARN 510tells the forwarding code that the source MAC address for this 511packet must not be used in the learning bridge code. 512.It NS_INDIRECT 513indicates that the packet's payload is in a user-supplied buffer 514whose user virtual address is in the 'ptr' field of the slot. 515The size can reach 65535 bytes. 516.Pp 517This is only supported on the transmit ring of 518.Nm VALE 519ports, and it helps reducing data copies in the interconnection 520of virtual machines. 521.It NS_MOREFRAG 522indicates that the packet continues with subsequent buffers; 523the last buffer in a packet must have the flag clear. 524.El 525.Sh SCATTER GATHER I/O 526Packets can span multiple slots if the 527.Va NS_MOREFRAG 528flag is set in all but the last slot. 529The maximum length of a chain is 64 buffers. 530This is normally used with 531.Nm VALE 532ports when connecting virtual machines, as they generate large 533TSO segments that are not split unless they reach a physical device. 534.Pp 535NOTE: The length field always refers to the individual 536fragment; there is no place with the total length of a packet. 537.Pp 538On receive rings the macro 539.Va NS_RFRAGS(slot) 540indicates the remaining number of slots for this packet, 541including the current one. 542Slots with a value greater than 1 also have NS_MOREFRAG set. 543.Sh IOCTLS 544.Nm 545uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 546for non-blocking I/O. 547They take no argument. 548Two more ioctls (NIOCGINFO, NIOCREGIF) are used 549to query and configure ports, with the following argument: 550.Bd -literal 551struct nmreq { 552 char nr_name[IFNAMSIZ]; /* (i) port name */ 553 uint32_t nr_version; /* (i) API version */ 554 uint32_t nr_offset; /* (o) nifp offset in mmap region */ 555 uint32_t nr_memsize; /* (o) size of the mmap region */ 556 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ 557 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ 558 uint16_t nr_tx_rings; /* (i/o) number of tx rings */ 559 uint16_t nr_rx_rings; /* (i/o) number of rx rings */ 560 uint16_t nr_ringid; /* (i/o) ring(s) we care about */ 561 uint16_t nr_cmd; /* (i) special command */ 562 uint16_t nr_arg1; /* (i/o) extra arguments */ 563 uint16_t nr_arg2; /* (i/o) extra arguments */ 564 uint32_t nr_arg3; /* (i/o) extra arguments */ 565 uint32_t nr_flags /* (i/o) open mode */ 566 ... 567}; 568.Ed 569.Pp 570A file descriptor obtained through 571.Pa /dev/netmap 572also supports the ioctl supported by network devices, see 573.Xr netintro 4 . 574.Bl -tag -width XXXX 575.It Dv NIOCGINFO 576returns EINVAL if the named port does not support netmap. 577Otherwise, it returns 0 and (advisory) information 578about the port. 579Note that all the information below can change before the 580interface is actually put in netmap mode. 581.Bl -tag -width XX 582.It Pa nr_memsize 583indicates the size of the 584.Nm 585memory region. 586NICs in 587.Nm 588mode all share the same memory region, 589whereas 590.Nm VALE 591ports have independent regions for each port. 592.It Pa nr_tx_slots , nr_rx_slots 593indicate the size of transmit and receive rings. 594.It Pa nr_tx_rings , nr_rx_rings 595indicate the number of transmit 596and receive rings. 597Both ring number and sizes may be configured at runtime 598using interface-specific functions (e.g., 599.Xr ethtool 8 600). 601.El 602.It Dv NIOCREGIF 603binds the port named in 604.Va nr_name 605to the file descriptor. 606For a physical device this also switches it into 607.Nm 608mode, disconnecting 609it from the host stack. 610Multiple file descriptors can be bound to the same port, 611with proper synchronization left to the user. 612.Pp 613The recommended way to bind a file descriptor to a port is 614to use function 615.Va nm_open(..) 616(see 617.Sx LIBRARIES ) 618which parses names to access specific port types and 619enable features. 620In the following we document the main features. 621.Pp 622.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a 623.Em netmap pipe , 624consisting of two netmap ports with a crossover connection. 625A netmap pipe share the same memory space of the parent port, 626and is meant to enable configuration where a master process acts 627as a dispatcher towards slave processes. 628.Pp 629To enable this function, the 630.Pa nr_arg1 631field of the structure can be used as a hint to the kernel to 632indicate how many pipes we expect to use, and reserve extra space 633in the memory region. 634.Pp 635On return, it gives the same info as NIOCGINFO, 636with 637.Pa nr_ringid 638and 639.Pa nr_flags 640indicating the identity of the rings controlled through the file 641descriptor. 642.Pp 643.Va nr_flags 644.Va nr_ringid 645selects which rings are controlled through this file descriptor. 646Possible values of 647.Pa nr_flags 648are indicated below, together with the naming schemes 649that application libraries (such as the 650.Nm nm_open 651indicated below) can use to indicate the specific set of rings. 652In the example below, "netmap:foo" is any valid netmap port name. 653.Bl -tag -width XXXXX 654.It NR_REG_ALL_NIC "netmap:foo" 655(default) all hardware ring pairs 656.It NR_REG_SW "netmap:foo^" 657the ``host rings'', connecting to the host stack. 658.It NR_REG_NIC_SW "netmap:foo+" 659all hardware rings and the host rings 660.It NR_REG_ONE_NIC "netmap:foo-i" 661only the i-th hardware ring pair, where the number is in 662.Pa nr_ringid ; 663.It NR_REG_PIPE_MASTER "netmap:foo{i" 664the master side of the netmap pipe whose identifier (i) is in 665.Pa nr_ringid ; 666.It NR_REG_PIPE_SLAVE "netmap:foo}i" 667the slave side of the netmap pipe whose identifier (i) is in 668.Pa nr_ringid . 669.Pp 670The identifier of a pipe must be thought as part of the pipe name, 671and does not need to be sequential. 672On return the pipe 673will only have a single ring pair with index 0, 674irrespective of the value of 675.Va i . 676.El 677.Pp 678By default, a 679.Xr poll 2 680or 681.Xr select 2 682call pushes out any pending packets on the transmit ring, even if 683no write events are specified. 684The feature can be disabled by or-ing 685.Va NETMAP_NO_TX_POLL 686to the value written to 687.Va nr_ringid . 688When this feature is used, 689packets are transmitted only on 690.Va ioctl(NIOCTXSYNC) 691or 692.Va select() / 693.Va poll() 694are called with a write event (POLLOUT/wfdset) or a full ring. 695.Pp 696When registering a virtual interface that is dynamically created to a 697.Nm VALE 698switch, we can specify the desired number of rings (1 by default, 699and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 700.It Dv NIOCTXSYNC 701tells the hardware of new packets to transmit, and updates the 702number of slots available for transmission. 703.It Dv NIOCRXSYNC 704tells the hardware of consumed packets, and asks for newly available 705packets. 706.El 707.Sh SELECT, POLL, EPOLL, KQUEUE 708.Xr select 2 709and 710.Xr poll 2 711on a 712.Nm 713file descriptor process rings as indicated in 714.Sx TRANSMIT RINGS 715and 716.Sx RECEIVE RINGS , 717respectively when write (POLLOUT) and read (POLLIN) events are requested. 718Both block if no slots are available in the ring 719.Va ( ring->cur == ring->tail ) . 720Depending on the platform, 721.Xr epoll 7 722and 723.Xr kqueue 2 724are supported too. 725.Pp 726Packets in transmit rings are normally pushed out 727(and buffers reclaimed) even without 728requesting write events. 729Passing the 730.Dv NETMAP_NO_TX_POLL 731flag to 732.Em NIOCREGIF 733disables this feature. 734By default, receive rings are processed only if read 735events are requested. 736Passing the 737.Dv NETMAP_DO_RX_POLL 738flag to 739.Em NIOCREGIF updates receive rings even without read events. 740Note that on 741.Xr epoll 7 742and 743.Xr kqueue 2 , 744.Dv NETMAP_NO_TX_POLL 745and 746.Dv NETMAP_DO_RX_POLL 747only have an effect when some event is posted for the file descriptor. 748.Sh LIBRARIES 749The 750.Nm 751API is supposed to be used directly, both because of its simplicity and 752for efficient integration with applications. 753.Pp 754For convenience, the 755.In net/netmap_user.h 756header provides a few macros and functions to ease creating 757a file descriptor and doing I/O with a 758.Nm 759port. 760These are loosely modeled after the 761.Xr pcap 3 762API, to ease porting of libpcap-based applications to 763.Nm . 764To use these extra functions, programs should 765.Dl #define NETMAP_WITH_LIBS 766before 767.Dl #include <net/netmap_user.h> 768.Pp 769The following functions are available: 770.Bl -tag -width XXXXX 771.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg ) 772similar to 773.Xr pcap_open_live 3 , 774binds a file descriptor to a port. 775.Bl -tag -width XX 776.It Va ifname 777is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a 778.Nm VALE 779port. 780.It Va req 781provides the initial values for the argument to the NIOCREGIF ioctl. 782The nm_flags and nm_ringid values are overwritten by parsing 783ifname and flags, and other fields can be overridden through 784the other two arguments. 785.It Va arg 786points to a struct nm_desc containing arguments (e.g., from a previously 787open file descriptor) that should override the defaults. 788The fields are used as described below 789.It Va flags 790can be set to a combination of the following flags: 791.Va NETMAP_NO_TX_POLL , 792.Va NETMAP_DO_RX_POLL 793(copied into nr_ringid); 794.Va NM_OPEN_NO_MMAP 795(if arg points to the same memory region, 796avoids the mmap and uses the values from it); 797.Va NM_OPEN_IFNAME 798(ignores ifname and uses the values in arg); 799.Va NM_OPEN_ARG1 , 800.Va NM_OPEN_ARG2 , 801.Va NM_OPEN_ARG3 802(uses the fields from arg); 803.Va NM_OPEN_RING_CFG 804(uses the ring number and sizes from arg). 805.El 806.It Va int nm_close(struct nm_desc *d ) 807closes the file descriptor, unmaps memory, frees resources. 808.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size ) 809similar to 810.Va pcap_inject() , 811pushes a packet to a ring, returns the size 812of the packet is successful, or 0 on error; 813.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg ) 814similar to 815.Va pcap_dispatch() , 816applies a callback to incoming packets 817.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr ) 818similar to 819.Va pcap_next() , 820fetches the next packet 821.El 822.Sh SUPPORTED DEVICES 823.Nm 824natively supports the following devices: 825.Pp 826On 827.Fx : 828.Xr cxgbe 4 , 829.Xr em 4 , 830.Xr iflib 4 831.Pq providing Xr igb 4 and Xr em 4 , 832.Xr ixgbe 4 , 833.Xr ixl 4 , 834.Xr re 4 , 835.Xr vtnet 4 . 836.Pp 837On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3. 838.Pp 839NICs without native support can still be used in 840.Nm 841mode through emulation. 842Performance is inferior to native netmap 843mode but still significantly higher than various raw socket types 844(bpf, PF_PACKET, etc.). 845Note that for slow devices (such as 1 Gbit/s and slower NICs, 846or several 10 Gbit/s NICs whose hardware is unable to sustain line rate), 847emulated and native mode will likely have similar or same throughput. 848.Pp 849When emulation is in use, packet sniffer programs such as tcpdump 850could see received packets before they are diverted by netmap. 851This behaviour is not intentional, being just an artifact of the implementation 852of emulation. 853Note that in case the netmap application subsequently moves packets received 854from the emulated adapter onto the host RX ring, the sniffer will intercept 855those packets again, since the packets are injected to the host stack as they 856were received by the network interface. 857.Pp 858Emulation is also available for devices with native netmap support, 859which can be used for testing or performance comparison. 860The sysctl variable 861.Va dev.netmap.admode 862globally controls how netmap mode is implemented. 863.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 864Some aspects of the operation of 865.Nm 866and 867.Nm VALE 868are controlled through sysctl variables on 869.Fx 870.Em ( dev.netmap.* ) 871and module parameters on Linux 872.Em ( /sys/module/netmap/parameters/* ) : 873.Bl -tag -width indent 874.It Va dev.netmap.admode: 0 875Controls the use of native or emulated adapter mode. 876.Pp 8770 uses the best available option; 878.Pp 8791 forces native mode and fails if not available; 880.Pp 8812 forces emulated hence never fails. 882.It Va dev.netmap.generic_rings: 1 883Number of rings used for emulated netmap mode 884.It Va dev.netmap.generic_ringsize: 1024 885Ring size used for emulated netmap mode 886.It Va dev.netmap.generic_mit: 100000 887Controls interrupt moderation for emulated mode 888.It Va dev.netmap.fwd: 0 889Forces NS_FORWARD mode 890.It Va dev.netmap.txsync_retry: 2 891Number of txsync loops in the 892.Nm VALE 893flush function 894.It Va dev.netmap.no_pendintr: 1 895Forces recovery of transmit buffers on system calls 896.It Va dev.netmap.no_timestamp: 0 897Disables the update of the timestamp in the netmap ring 898.It Va dev.netmap.verbose: 0 899Verbose kernel messages 900.It Va dev.netmap.buf_num: 163840 901.It Va dev.netmap.buf_size: 2048 902.It Va dev.netmap.ring_num: 200 903.It Va dev.netmap.ring_size: 36864 904.It Va dev.netmap.if_num: 100 905.It Va dev.netmap.if_size: 1024 906Sizes and number of objects (netmap_if, netmap_ring, buffers) 907for the global memory region. 908The only parameter worth modifying is 909.Va dev.netmap.buf_num 910as it impacts the total amount of memory used by netmap. 911.It Va dev.netmap.buf_curr_num: 0 912.It Va dev.netmap.buf_curr_size: 0 913.It Va dev.netmap.ring_curr_num: 0 914.It Va dev.netmap.ring_curr_size: 0 915.It Va dev.netmap.if_curr_num: 0 916.It Va dev.netmap.if_curr_size: 0 917Actual values in use. 918.It Va dev.netmap.priv_buf_num: 4098 919.It Va dev.netmap.priv_buf_size: 2048 920.It Va dev.netmap.priv_ring_num: 4 921.It Va dev.netmap.priv_ring_size: 20480 922.It Va dev.netmap.priv_if_num: 2 923.It Va dev.netmap.priv_if_size: 1024 924Sizes and number of objects (netmap_if, netmap_ring, buffers) 925for private memory regions. 926A separate memory region is used for each 927.Nm VALE 928port and each pair of 929.Nm netmap pipes . 930.It Va dev.netmap.bridge_batch: 1024 931Batch size used when moving packets across a 932.Nm VALE 933switch. 934Values above 64 generally guarantee good 935performance. 936.It Va dev.netmap.ptnet_vnet_hdr: 1 937Allow ptnet devices to use virtio-net headers 938.El 939.Sh SYSTEM CALLS 940.Nm 941uses 942.Xr select 2 , 943.Xr poll 2 , 944.Xr epoll 7 945and 946.Xr kqueue 2 947to wake up processes when significant events occur, and 948.Xr mmap 2 949to map memory. 950.Xr ioctl 2 951is used to configure ports and 952.Nm VALE switches . 953.Pp 954Applications may need to create threads and bind them to 955specific cores to improve performance, using standard 956OS primitives, see 957.Xr pthread 3 . 958In particular, 959.Xr pthread_setaffinity_np 3 960may be of use. 961.Sh EXAMPLES 962.Ss TEST PROGRAMS 963.Nm 964comes with a few programs that can be used for testing or 965simple applications. 966See the 967.Pa examples/ 968directory in 969.Nm 970distributions, or 971.Pa tools/tools/netmap/ 972directory in 973.Fx 974distributions. 975.Pp 976.Xr pkt-gen 8 977is a general purpose traffic source/sink. 978.Pp 979As an example 980.Dl pkt-gen -i ix0 -f tx -l 60 981can generate an infinite stream of minimum size packets, and 982.Dl pkt-gen -i ix0 -f rx 983is a traffic sink. 984Both print traffic statistics, to help monitor 985how the system performs. 986.Pp 987.Xr pkt-gen 8 988has many options can be uses to set packet sizes, addresses, 989rates, and use multiple send/receive threads and cores. 990.Pp 991.Xr bridge 4 992is another test program which interconnects two 993.Nm 994ports. 995It can be used for transparent forwarding between 996interfaces, as in 997.Dl bridge -i netmap:ix0 -i netmap:ix1 998or even connect the NIC to the host stack using netmap 999.Dl bridge -i netmap:ix0 1000.Ss USING THE NATIVE API 1001The following code implements a traffic generator: 1002.Pp 1003.Bd -literal -compact 1004#include <net/netmap_user.h> 1005\&... 1006void sender(void) 1007{ 1008 struct netmap_if *nifp; 1009 struct netmap_ring *ring; 1010 struct nmreq nmr; 1011 struct pollfd fds; 1012 1013 fd = open("/dev/netmap", O_RDWR); 1014 bzero(&nmr, sizeof(nmr)); 1015 strcpy(nmr.nr_name, "ix0"); 1016 nmr.nm_version = NETMAP_API; 1017 ioctl(fd, NIOCREGIF, &nmr); 1018 p = mmap(0, nmr.nr_memsize, fd); 1019 nifp = NETMAP_IF(p, nmr.nr_offset); 1020 ring = NETMAP_TXRING(nifp, 0); 1021 fds.fd = fd; 1022 fds.events = POLLOUT; 1023 for (;;) { 1024 poll(&fds, 1, -1); 1025 while (!nm_ring_empty(ring)) { 1026 i = ring->cur; 1027 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 1028 ... prepare packet in buf ... 1029 ring->slot[i].len = ... packet length ... 1030 ring->head = ring->cur = nm_ring_next(ring, i); 1031 } 1032 } 1033} 1034.Ed 1035.Ss HELPER FUNCTIONS 1036A simple receiver can be implemented using the helper functions: 1037.Pp 1038.Bd -literal -compact 1039#define NETMAP_WITH_LIBS 1040#include <net/netmap_user.h> 1041\&... 1042void receiver(void) 1043{ 1044 struct nm_desc *d; 1045 struct pollfd fds; 1046 u_char *buf; 1047 struct nm_pkthdr h; 1048 ... 1049 d = nm_open("netmap:ix0", NULL, 0, 0); 1050 fds.fd = NETMAP_FD(d); 1051 fds.events = POLLIN; 1052 for (;;) { 1053 poll(&fds, 1, -1); 1054 while ( (buf = nm_nextpkt(d, &h)) ) 1055 consume_pkt(buf, h.len); 1056 } 1057 nm_close(d); 1058} 1059.Ed 1060.Ss ZERO-COPY FORWARDING 1061Since physical interfaces share the same memory region, 1062it is possible to do packet forwarding between ports 1063swapping buffers. 1064The buffer from the transmit ring is used 1065to replenish the receive ring: 1066.Pp 1067.Bd -literal -compact 1068 uint32_t tmp; 1069 struct netmap_slot *src, *dst; 1070 ... 1071 src = &src_ring->slot[rxr->cur]; 1072 dst = &dst_ring->slot[txr->cur]; 1073 tmp = dst->buf_idx; 1074 dst->buf_idx = src->buf_idx; 1075 dst->len = src->len; 1076 dst->flags = NS_BUF_CHANGED; 1077 src->buf_idx = tmp; 1078 src->flags = NS_BUF_CHANGED; 1079 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 1080 txr->head = txr->cur = nm_ring_next(txr, txr->cur); 1081 ... 1082.Ed 1083.Ss ACCESSING THE HOST STACK 1084The host stack is for all practical purposes just a regular ring pair, 1085which you can access with the netmap API (e.g., with 1086.Dl nm_open("netmap:eth0^", ... ) ; 1087All packets that the host would send to an interface in 1088.Nm 1089mode end up into the RX ring, whereas all packets queued to the 1090TX ring are send up to the host stack. 1091.Ss VALE SWITCH 1092A simple way to test the performance of a 1093.Nm VALE 1094switch is to attach a sender and a receiver to it, 1095e.g., running the following in two different terminals: 1096.Dl pkt-gen -i vale1:a -f rx # receiver 1097.Dl pkt-gen -i vale1:b -f tx # sender 1098The same example can be used to test netmap pipes, by simply 1099changing port names, e.g., 1100.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side 1101.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side 1102.Pp 1103The following command attaches an interface and the host stack 1104to a switch: 1105.Dl valectl -h vale2:em0 1106Other 1107.Nm 1108clients attached to the same switch can now communicate 1109with the network card or the host. 1110.Sh SEE ALSO 1111.Xr vale 4 , 1112.Xr valectl 8 , 1113.Xr bridge 8 , 1114.Xr lb 8 , 1115.Xr nmreplay 8 , 1116.Xr pkt-gen 8 1117.Pp 1118.Pa http://info.iet.unipi.it/~luigi/netmap/ 1119.Pp 1120Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 1121Communications of the ACM, 55 (3), pp.45-51, March 2012 1122.Pp 1123Luigi Rizzo, netmap: a novel framework for fast packet I/O, 1124Usenix ATC'12, June 2012, Boston 1125.Pp 1126Luigi Rizzo, Giuseppe Lettieri, 1127VALE, a switched ethernet for virtual machines, 1128ACM CoNEXT'12, December 2012, Nice 1129.Pp 1130Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, 1131Speeding up packet I/O in virtual machines, 1132ACM/IEEE ANCS'13, October 2013, San Jose 1133.Sh AUTHORS 1134.An -nosplit 1135The 1136.Nm 1137framework has been originally designed and implemented at the 1138Universita` di Pisa in 2011 by 1139.An Luigi Rizzo , 1140and further extended with help from 1141.An Matteo Landi , 1142.An Gaetano Catalli , 1143.An Giuseppe Lettieri , 1144and 1145.An Vincenzo Maffione . 1146.Pp 1147.Nm 1148and 1149.Nm VALE 1150have been funded by the European Commission within FP7 Projects 1151CHANGE (257422) and OPENLAB (287581). 1152.Sh CAVEATS 1153No matter how fast the CPU and OS are, 1154achieving line rate on 10G and faster interfaces 1155requires hardware with sufficient performance. 1156Several NICs are unable to sustain line rate with 1157small packet sizes. 1158Insufficient PCIe or memory bandwidth 1159can also cause reduced performance. 1160.Pp 1161Another frequent reason for low performance is the use 1162of flow control on the link: a slow receiver can limit 1163the transmit speed. 1164Be sure to disable flow control when running high 1165speed experiments. 1166.Ss SPECIAL NIC FEATURES 1167.Nm 1168is orthogonal to some NIC features such as 1169multiqueue, schedulers, packet filters. 1170.Pp 1171Multiple transmit and receive rings are supported natively 1172and can be configured with ordinary OS tools, 1173such as 1174.Xr ethtool 8 1175or 1176device-specific sysctl variables. 1177The same goes for Receive Packet Steering (RPS) 1178and filtering of incoming traffic. 1179.Pp 1180.Nm 1181.Em does not use 1182features such as 1183.Em checksum offloading , TCP segmentation offloading , 1184.Em encryption , VLAN encapsulation/decapsulation , 1185etc. 1186When using netmap to exchange packets with the host stack, 1187make sure to disable these features. 1188