1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.Dd October 10, 2024 29.Dt NETMAP 4 30.Os 31.Sh NAME 32.Nm netmap 33.Nd a framework for fast packet I/O 34.Sh SYNOPSIS 35.Cd device netmap 36.Sh DESCRIPTION 37.Nm 38is a framework for extremely fast and efficient packet I/O 39for userspace and kernel clients, and for Virtual Machines. 40It runs on 41.Fx , 42Linux and some versions of Windows, and supports a variety of 43.Nm netmap ports , 44including 45.Bl -tag -width XXXX 46.It Nm physical NIC ports 47to access individual queues of network interfaces; 48.It Nm host ports 49to inject packets into the host stack; 50.It Nm VALE ports 51implementing a very fast and modular in-kernel software switch/dataplane; 52.It Nm netmap pipes 53a shared memory packet transport channel; 54.It Nm netmap monitors 55a mechanism similar to 56.Xr bpf 4 57to capture traffic 58.El 59.Pp 60All these 61.Nm netmap ports 62are accessed interchangeably with the same API, 63and are at least one order of magnitude faster than 64standard OS mechanisms 65(sockets, bpf, tun/tap interfaces, native switches, pipes). 66With suitably fast hardware (NICs, PCIe buses, CPUs), 67packet I/O using 68.Nm 69on supported NICs 70reaches 14.88 million packets per second (Mpps) 71with much less than one core on 10 Gbit/s NICs; 7235-40 Mpps on 40 Gbit/s NICs (limited by the hardware); 73about 20 Mpps per core for VALE ports; 74and over 100 Mpps for 75.Nm netmap pipes . 76NICs without native 77.Nm 78support can still use the API in emulated mode, 79which uses unmodified device drivers and is 3-5 times faster than 80.Xr bpf 4 81or raw sockets. 82.Pp 83Userspace clients can dynamically switch NICs into 84.Nm 85mode and send and receive raw packets through 86memory mapped buffers. 87Similarly, 88.Nm VALE 89switch instances and ports, 90.Nm netmap pipes 91and 92.Nm netmap monitors 93can be created dynamically, 94providing high speed packet I/O between processes, 95virtual machines, NICs and the host stack. 96.Pp 97.Nm 98supports both non-blocking I/O through 99.Xr ioctl 2 , 100synchronization and blocking I/O through a file descriptor 101and standard OS mechanisms such as 102.Xr select 2 , 103.Xr poll 2 , 104.Xr kqueue 2 105and 106.Xr epoll 7 . 107All types of 108.Nm netmap ports 109and the 110.Nm VALE switch 111are implemented by a single kernel module, which also emulates the 112.Nm 113API over standard drivers. 114For best performance, 115.Nm 116requires native support in device drivers. 117A list of such devices is at the end of this document. 118.Pp 119In the rest of this (long) manual page we document 120various aspects of the 121.Nm 122and 123.Nm VALE 124architecture, features and usage. 125.Sh ARCHITECTURE 126.Nm 127supports raw packet I/O through a 128.Em port , 129which can be connected to a physical interface 130.Em ( NIC ) , 131to the host stack, 132or to a 133.Nm VALE 134switch. 135Ports use preallocated circular queues of buffers 136.Em ( rings ) 137residing in an mmapped region. 138There is one ring for each transmit/receive queue of a 139NIC or virtual port. 140An additional ring pair connects to the host stack. 141.Pp 142After binding a file descriptor to a port, a 143.Nm 144client can send or receive packets in batches through 145the rings, and possibly implement zero-copy forwarding 146between ports. 147.Pp 148All NICs operating in 149.Nm 150mode use the same memory region, 151accessible to all processes who own 152.Pa /dev/netmap 153file descriptors bound to NICs. 154Independent 155.Nm VALE 156and 157.Nm netmap pipe 158ports 159by default use separate memory regions, 160but can be independently configured to share memory. 161.Sh ENTERING AND EXITING NETMAP MODE 162The following section describes the system calls to create 163and control 164.Nm netmap 165ports (including 166.Nm VALE 167and 168.Nm netmap pipe 169ports). 170Simpler, higher level functions are described in the 171.Sx LIBRARIES 172section. 173.Pp 174Ports and rings are created and controlled through a file descriptor, 175created by opening a special device 176.Dl fd = open("/dev/netmap"); 177and then bound to a specific port with an 178.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 179.Pp 180.Nm 181has multiple modes of operation controlled by the 182.Vt struct nmreq 183argument. 184.Va arg.nr_name 185specifies the netmap port name, as follows: 186.Bl -tag -width XXXX 187.It Dv OS network interface name (e.g., 'em0', 'eth1', ... ) 188the data path of the NIC is disconnected from the host stack, 189and the file descriptor is bound to the NIC (one or all queues), 190or to the host stack; 191.It Dv valeSSS:PPP 192the file descriptor is bound to port PPP of VALE switch SSS. 193Switch instances and ports are dynamically created if necessary. 194.Pp 195Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string 196cannot exceed IFNAMSIZ characters, and PPP cannot 197be the name of any existing OS network interface. 198.El 199.Pp 200On return, 201.Va arg 202indicates the size of the shared memory region, 203and the number, size and location of all the 204.Nm 205data structures, which can be accessed by mmapping the memory 206.Dl char *mem = mmap(0, arg.nr_memsize, fd); 207.Pp 208Non-blocking I/O is done with special 209.Xr ioctl 2 210.Xr select 2 211and 212.Xr poll 2 213on the file descriptor permit blocking I/O. 214.Pp 215While a NIC is in 216.Nm 217mode, the OS will still believe the interface is up and running. 218OS-generated packets for that NIC end up into a 219.Nm 220ring, and another ring is used to send packets into the OS network stack. 221A 222.Xr close 2 223on the file descriptor removes the binding, 224and returns the NIC to normal mode (reconnecting the data path 225to the host stack), or destroys the virtual port. 226.Sh DATA STRUCTURES 227The data structures in the mmapped memory region are detailed in 228.In sys/net/netmap.h , 229which is the ultimate reference for the 230.Nm 231API. 232The main structures and fields are indicated below: 233.Bl -tag -width XXX 234.It Dv struct netmap_if (one per interface ) 235.Bd -literal 236struct netmap_if { 237 ... 238 const uint32_t ni_flags; /* properties */ 239 ... 240 const uint32_t ni_tx_rings; /* NIC tx rings */ 241 const uint32_t ni_rx_rings; /* NIC rx rings */ 242 uint32_t ni_bufs_head; /* head of extra bufs list */ 243 ... 244}; 245.Ed 246.Pp 247Indicates the number of available rings 248.Pa ( struct netmap_rings ) 249and their position in the mmapped region. 250The number of tx and rx rings 251.Pa ( ni_tx_rings , ni_rx_rings ) 252normally depends on the hardware. 253NICs also have an extra tx/rx ring pair connected to the host stack. 254.Em NIOCREGIF 255can also request additional unbound buffers in the same memory space, 256to be used as temporary storage for packets. 257The number of extra 258buffers is specified in the 259.Va arg.nr_arg3 260field. 261On success, the kernel writes back to 262.Va arg.nr_arg3 263the number of extra buffers actually allocated (they may be less 264than the amount requested if the memory space ran out of buffers). 265.Pa ni_bufs_head 266contains the index of the first of these extra buffers, 267which are connected in a list (the first uint32_t of each 268buffer being the index of the next buffer in the list). 269A 270.Dv 0 271indicates the end of the list. 272The application is free to modify 273this list and use the buffers (i.e., binding them to the slots of a 274netmap ring). 275When closing the netmap file descriptor, 276the kernel frees the buffers contained in the list pointed by 277.Pa ni_bufs_head 278, irrespectively of the buffers originally provided by the kernel on 279.Em NIOCREGIF . 280.It Dv struct netmap_ring (one per ring ) 281.Bd -literal 282struct netmap_ring { 283 ... 284 const uint32_t num_slots; /* slots in each ring */ 285 const uint32_t nr_buf_size; /* size of each buffer */ 286 ... 287 uint32_t head; /* (u) first buf owned by user */ 288 uint32_t cur; /* (u) wakeup position */ 289 const uint32_t tail; /* (k) first buf owned by kernel */ 290 ... 291 uint32_t flags; 292 struct timeval ts; /* (k) time of last rxsync() */ 293 ... 294 struct netmap_slot slot[0]; /* array of slots */ 295} 296.Ed 297.Pp 298Implements transmit and receive rings, with read/write 299pointers, metadata and an array of 300.Em slots 301describing the buffers. 302.It Dv struct netmap_slot (one per buffer ) 303.Bd -literal 304struct netmap_slot { 305 uint32_t buf_idx; /* buffer index */ 306 uint16_t len; /* packet length */ 307 uint16_t flags; /* buf changed, etc. */ 308 uint64_t ptr; /* address for indirect buffers */ 309}; 310.Ed 311.Pp 312Describes a packet buffer, which normally is identified by 313an index and resides in the mmapped region. 314.It Dv packet buffers 315Fixed size (normally 2 KB) packet buffers allocated by the kernel. 316.El 317.Pp 318The offset of the 319.Pa struct netmap_if 320in the mmapped region is indicated by the 321.Pa nr_offset 322field in the structure returned by 323.Dv NIOCREGIF . 324From there, all other objects are reachable through 325relative references (offsets or indexes). 326Macros and functions in 327.In net/netmap_user.h 328help converting them into actual pointers: 329.Pp 330.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); 331.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); 332.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); 333.Pp 334.Dl char *buf = NETMAP_BUF(ring, buffer_index); 335.Sh RINGS, BUFFERS AND DATA I/O 336.Va Rings 337are circular queues of packets with three indexes/pointers 338.Va ( head , cur , tail ) ; 339one slot is always kept empty. 340The ring size 341.Va ( num_slots ) 342should not be assumed to be a power of two. 343.Pp 344.Va head 345is the first slot available to userspace; 346.Pp 347.Va cur 348is the wakeup point: 349select/poll will unblock when 350.Va tail 351passes 352.Va cur ; 353.Pp 354.Va tail 355is the first slot reserved to the kernel. 356.Pp 357Slot indexes 358.Em must 359only move forward; 360for convenience, the function 361.Dl nm_ring_next(ring, index) 362returns the next index modulo the ring size. 363.Pp 364.Va head 365and 366.Va cur 367are only modified by the user program; 368.Va tail 369is only modified by the kernel. 370The kernel only reads/writes the 371.Vt struct netmap_ring 372slots and buffers 373during the execution of a netmap-related system call. 374The only exception are slots (and buffers) in the range 375.Va tail\ . . . head-1 , 376that are explicitly assigned to the kernel. 377.Ss TRANSMIT RINGS 378On transmit rings, after a 379.Nm 380system call, slots in the range 381.Va head\ . . . tail-1 382are available for transmission. 383User code should fill the slots sequentially 384and advance 385.Va head 386and 387.Va cur 388past slots ready to transmit. 389.Va cur 390may be moved further ahead if the user code needs 391more slots before further transmissions (see 392.Sx SCATTER GATHER I/O ) . 393.Pp 394At the next NIOCTXSYNC/select()/poll(), 395slots up to 396.Va head-1 397are pushed to the port, and 398.Va tail 399may advance if further slots have become available. 400Below is an example of the evolution of a TX ring: 401.Bd -literal 402 after the syscall, slots between cur and tail are (a)vailable 403 head=cur tail 404 | | 405 v v 406 TX [.....aaaaaaaaaaa.............] 407 408 user creates new packets to (T)ransmit 409 head=cur tail 410 | | 411 v v 412 TX [.....TTTTTaaaaaa.............] 413 414 NIOCTXSYNC/poll()/select() sends packets and reports new slots 415 head=cur tail 416 | | 417 v v 418 TX [..........aaaaaaaaaaa........] 419.Ed 420.Pp 421.Fn select 422and 423.Fn poll 424will block if there is no space in the ring, i.e., 425.Dl ring->cur == ring->tail 426and return when new slots have become available. 427.Pp 428High speed applications may want to amortize the cost of system calls 429by preparing as many packets as possible before issuing them. 430.Pp 431A transmit ring with pending transmissions has 432.Dl ring->head != ring->tail + 1 (modulo the ring size). 433The function 434.Va int nm_tx_pending(ring) 435implements this test. 436.Ss RECEIVE RINGS 437On receive rings, after a 438.Nm 439system call, the slots in the range 440.Va head\& . . . tail-1 441contain received packets. 442User code should process them and advance 443.Va head 444and 445.Va cur 446past slots it wants to return to the kernel. 447.Va cur 448may be moved further ahead if the user code wants to 449wait for more packets 450without returning all the previous slots to the kernel. 451.Pp 452At the next NIOCRXSYNC/select()/poll(), 453slots up to 454.Va head-1 455are returned to the kernel for further receives, and 456.Va tail 457may advance to report new incoming packets. 458.Pp 459Below is an example of the evolution of an RX ring: 460.Bd -literal 461 after the syscall, there are some (h)eld and some (R)eceived slots 462 head cur tail 463 | | | 464 v v v 465 RX [..hhhhhhRRRRRRRR..........] 466 467 user advances head and cur, releasing some slots and holding others 468 head cur tail 469 | | | 470 v v v 471 RX [..*****hhhRRRRRR...........] 472 473 NICRXSYNC/poll()/select() recovers slots and reports new packets 474 head cur tail 475 | | | 476 v v v 477 RX [.......hhhRRRRRRRRRRRR....] 478.Ed 479.Sh SLOTS AND PACKET BUFFERS 480Normally, packets should be stored in the netmap-allocated buffers 481assigned to slots when ports are bound to a file descriptor. 482One packet is fully contained in a single buffer. 483.Pp 484The following flags affect slot and buffer processing: 485.Bl -tag -width XXX 486.It NS_BUF_CHANGED 487.Em must 488be used when the 489.Va buf_idx 490in the slot is changed. 491This can be used to implement 492zero-copy forwarding, see 493.Sx ZERO-COPY FORWARDING . 494.It NS_REPORT 495reports when this buffer has been transmitted. 496Normally, 497.Nm 498notifies transmit completions in batches, hence signals 499can be delayed indefinitely. 500This flag helps detect 501when packets have been sent and a file descriptor can be closed. 502.It NS_FORWARD 503When a ring is in 'transparent' mode, 504packets marked with this flag by the user application are forwarded to the 505other endpoint at the next system call, thus restoring (in a selective way) 506the connection between a NIC and the host stack. 507.It NS_NO_LEARN 508tells the forwarding code that the source MAC address for this 509packet must not be used in the learning bridge code. 510.It NS_INDIRECT 511indicates that the packet's payload is in a user-supplied buffer 512whose user virtual address is in the 'ptr' field of the slot. 513The size can reach 65535 bytes. 514.Pp 515This is only supported on the transmit ring of 516.Nm VALE 517ports, and it helps reducing data copies in the interconnection 518of virtual machines. 519.It NS_MOREFRAG 520indicates that the packet continues with subsequent buffers; 521the last buffer in a packet must have the flag clear. 522.El 523.Sh SCATTER GATHER I/O 524Packets can span multiple slots if the 525.Va NS_MOREFRAG 526flag is set in all but the last slot. 527The maximum length of a chain is 64 buffers. 528This is normally used with 529.Nm VALE 530ports when connecting virtual machines, as they generate large 531TSO segments that are not split unless they reach a physical device. 532.Pp 533NOTE: The length field always refers to the individual 534fragment; there is no place with the total length of a packet. 535.Pp 536On receive rings the macro 537.Va NS_RFRAGS(slot) 538indicates the remaining number of slots for this packet, 539including the current one. 540Slots with a value greater than 1 also have NS_MOREFRAG set. 541.Sh IOCTLS 542.Nm 543uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) 544for non-blocking I/O. 545They take no argument. 546Two more ioctls (NIOCGINFO, NIOCREGIF) are used 547to query and configure ports, with the following argument: 548.Bd -literal 549struct nmreq { 550 char nr_name[IFNAMSIZ]; /* (i) port name */ 551 uint32_t nr_version; /* (i) API version */ 552 uint32_t nr_offset; /* (o) nifp offset in mmap region */ 553 uint32_t nr_memsize; /* (o) size of the mmap region */ 554 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ 555 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ 556 uint16_t nr_tx_rings; /* (i/o) number of tx rings */ 557 uint16_t nr_rx_rings; /* (i/o) number of rx rings */ 558 uint16_t nr_ringid; /* (i/o) ring(s) we care about */ 559 uint16_t nr_cmd; /* (i) special command */ 560 uint16_t nr_arg1; /* (i/o) extra arguments */ 561 uint16_t nr_arg2; /* (i/o) extra arguments */ 562 uint32_t nr_arg3; /* (i/o) extra arguments */ 563 uint32_t nr_flags /* (i/o) open mode */ 564 ... 565}; 566.Ed 567.Pp 568A file descriptor obtained through 569.Pa /dev/netmap 570also supports the ioctl supported by network devices, see 571.Xr netintro 4 . 572.Bl -tag -width XXXX 573.It Dv NIOCGINFO 574returns EINVAL if the named port does not support netmap. 575Otherwise, it returns 0 and (advisory) information 576about the port. 577Note that all the information below can change before the 578interface is actually put in netmap mode. 579.Bl -tag -width XX 580.It Pa nr_memsize 581indicates the size of the 582.Nm 583memory region. 584NICs in 585.Nm 586mode all share the same memory region, 587whereas 588.Nm VALE 589ports have independent regions for each port. 590.It Pa nr_tx_slots , nr_rx_slots 591indicate the size of transmit and receive rings. 592.It Pa nr_tx_rings , nr_rx_rings 593indicate the number of transmit 594and receive rings. 595Both ring number and sizes may be configured at runtime 596using interface-specific functions (e.g., 597.Xr ethtool 8 598). 599.El 600.It Dv NIOCREGIF 601binds the port named in 602.Va nr_name 603to the file descriptor. 604For a physical device this also switches it into 605.Nm 606mode, disconnecting 607it from the host stack. 608Multiple file descriptors can be bound to the same port, 609with proper synchronization left to the user. 610.Pp 611The recommended way to bind a file descriptor to a port is 612to use function 613.Va nm_open(..) 614(see 615.Sx LIBRARIES ) 616which parses names to access specific port types and 617enable features. 618In the following we document the main features. 619.Pp 620.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a 621.Em netmap pipe , 622consisting of two netmap ports with a crossover connection. 623A netmap pipe share the same memory space of the parent port, 624and is meant to enable configuration where a master process acts 625as a dispatcher towards slave processes. 626.Pp 627To enable this function, the 628.Pa nr_arg1 629field of the structure can be used as a hint to the kernel to 630indicate how many pipes we expect to use, and reserve extra space 631in the memory region. 632.Pp 633On return, it gives the same info as NIOCGINFO, 634with 635.Pa nr_ringid 636and 637.Pa nr_flags 638indicating the identity of the rings controlled through the file 639descriptor. 640.Pp 641.Va nr_flags 642.Va nr_ringid 643selects which rings are controlled through this file descriptor. 644Possible values of 645.Pa nr_flags 646are indicated below, together with the naming schemes 647that application libraries (such as the 648.Nm nm_open 649indicated below) can use to indicate the specific set of rings. 650In the example below, "netmap:foo" is any valid netmap port name. 651.Bl -tag -width XXXXX 652.It NR_REG_ALL_NIC "netmap:foo" 653(default) all hardware ring pairs 654.It NR_REG_SW "netmap:foo^" 655the ``host rings'', connecting to the host stack. 656.It NR_REG_NIC_SW "netmap:foo*" 657all hardware rings and the host rings 658.It NR_REG_ONE_NIC "netmap:foo-i" 659only the i-th hardware ring pair, where the number is in 660.Pa nr_ringid ; 661.It NR_REG_PIPE_MASTER "netmap:foo{i" 662the master side of the netmap pipe whose identifier (i) is in 663.Pa nr_ringid ; 664.It NR_REG_PIPE_SLAVE "netmap:foo}i" 665the slave side of the netmap pipe whose identifier (i) is in 666.Pa nr_ringid . 667.Pp 668The identifier of a pipe must be thought as part of the pipe name, 669and does not need to be sequential. 670On return the pipe 671will only have a single ring pair with index 0, 672irrespective of the value of 673.Va i . 674.El 675.Pp 676By default, a 677.Xr poll 2 678or 679.Xr select 2 680call pushes out any pending packets on the transmit ring, even if 681no write events are specified. 682The feature can be disabled by or-ing 683.Va NETMAP_NO_TX_POLL 684to the value written to 685.Va nr_ringid . 686When this feature is used, 687packets are transmitted only on 688.Va ioctl(NIOCTXSYNC) 689or 690.Va select() / 691.Va poll() 692are called with a write event (POLLOUT/wfdset) or a full ring. 693.Pp 694When registering a virtual interface that is dynamically created to a 695.Nm VALE 696switch, we can specify the desired number of rings (1 by default, 697and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. 698.It Dv NIOCTXSYNC 699tells the hardware of new packets to transmit, and updates the 700number of slots available for transmission. 701.It Dv NIOCRXSYNC 702tells the hardware of consumed packets, and asks for newly available 703packets. 704.El 705.Sh SELECT, POLL, EPOLL, KQUEUE 706.Xr select 2 707and 708.Xr poll 2 709on a 710.Nm 711file descriptor process rings as indicated in 712.Sx TRANSMIT RINGS 713and 714.Sx RECEIVE RINGS , 715respectively when write (POLLOUT) and read (POLLIN) events are requested. 716Both block if no slots are available in the ring 717.Va ( ring->cur == ring->tail ) . 718Depending on the platform, 719.Xr epoll 7 720and 721.Xr kqueue 2 722are supported too. 723.Pp 724Packets in transmit rings are normally pushed out 725(and buffers reclaimed) even without 726requesting write events. 727Passing the 728.Dv NETMAP_NO_TX_POLL 729flag to 730.Em NIOCREGIF 731disables this feature. 732By default, receive rings are processed only if read 733events are requested. 734Passing the 735.Dv NETMAP_DO_RX_POLL 736flag to 737.Em NIOCREGIF updates receive rings even without read events. 738Note that on 739.Xr epoll 7 740and 741.Xr kqueue 2 , 742.Dv NETMAP_NO_TX_POLL 743and 744.Dv NETMAP_DO_RX_POLL 745only have an effect when some event is posted for the file descriptor. 746.Sh LIBRARIES 747The 748.Nm 749API is supposed to be used directly, both because of its simplicity and 750for efficient integration with applications. 751.Pp 752For convenience, the 753.In net/netmap_user.h 754header provides a few macros and functions to ease creating 755a file descriptor and doing I/O with a 756.Nm 757port. 758These are loosely modeled after the 759.Xr pcap 3 760API, to ease porting of libpcap-based applications to 761.Nm . 762To use these extra functions, programs should 763.Dl #define NETMAP_WITH_LIBS 764before 765.Dl #include <net/netmap_user.h> 766.Pp 767The following functions are available: 768.Bl -tag -width XXXXX 769.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg ) 770similar to 771.Xr pcap_open_live 3 , 772binds a file descriptor to a port. 773.Bl -tag -width XX 774.It Va ifname 775is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a 776.Nm VALE 777port. 778.It Va req 779provides the initial values for the argument to the NIOCREGIF ioctl. 780The nm_flags and nm_ringid values are overwritten by parsing 781ifname and flags, and other fields can be overridden through 782the other two arguments. 783.It Va arg 784points to a struct nm_desc containing arguments (e.g., from a previously 785open file descriptor) that should override the defaults. 786The fields are used as described below 787.It Va flags 788can be set to a combination of the following flags: 789.Va NETMAP_NO_TX_POLL , 790.Va NETMAP_DO_RX_POLL 791(copied into nr_ringid); 792.Va NM_OPEN_NO_MMAP 793(if arg points to the same memory region, 794avoids the mmap and uses the values from it); 795.Va NM_OPEN_IFNAME 796(ignores ifname and uses the values in arg); 797.Va NM_OPEN_ARG1 , 798.Va NM_OPEN_ARG2 , 799.Va NM_OPEN_ARG3 800(uses the fields from arg); 801.Va NM_OPEN_RING_CFG 802(uses the ring number and sizes from arg). 803.El 804.It Va int nm_close(struct nm_desc *d ) 805closes the file descriptor, unmaps memory, frees resources. 806.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size ) 807similar to 808.Va pcap_inject() , 809pushes a packet to a ring, returns the size 810of the packet is successful, or 0 on error; 811.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg ) 812similar to 813.Va pcap_dispatch() , 814applies a callback to incoming packets 815.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr ) 816similar to 817.Va pcap_next() , 818fetches the next packet 819.El 820.Sh SUPPORTED DEVICES 821.Nm 822natively supports the following devices: 823.Pp 824On 825.Fx : 826.Xr cxgbe 4 , 827.Xr em 4 , 828.Xr iflib 4 829.Pq providing Xr igb 4 and Xr em 4 , 830.Xr ixgbe 4 , 831.Xr ixl 4 , 832.Xr re 4 , 833.Xr vtnet 4 . 834.Pp 835On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3. 836.Pp 837NICs without native support can still be used in 838.Nm 839mode through emulation. 840Performance is inferior to native netmap 841mode but still significantly higher than various raw socket types 842(bpf, PF_PACKET, etc.). 843Note that for slow devices (such as 1 Gbit/s and slower NICs, 844or several 10 Gbit/s NICs whose hardware is unable to sustain line rate), 845emulated and native mode will likely have similar or same throughput. 846.Pp 847When emulation is in use, packet sniffer programs such as tcpdump 848could see received packets before they are diverted by netmap. 849This behaviour is not intentional, being just an artifact of the implementation 850of emulation. 851Note that in case the netmap application subsequently moves packets received 852from the emulated adapter onto the host RX ring, the sniffer will intercept 853those packets again, since the packets are injected to the host stack as they 854were received by the network interface. 855.Pp 856Emulation is also available for devices with native netmap support, 857which can be used for testing or performance comparison. 858The sysctl variable 859.Va dev.netmap.admode 860globally controls how netmap mode is implemented. 861.Sh SYSCTL VARIABLES AND MODULE PARAMETERS 862Some aspects of the operation of 863.Nm 864and 865.Nm VALE 866are controlled through sysctl variables on 867.Fx 868.Em ( dev.netmap.* ) 869and module parameters on Linux 870.Em ( /sys/module/netmap/parameters/* ) : 871.Bl -tag -width indent 872.It Va dev.netmap.admode: 0 873Controls the use of native or emulated adapter mode. 874.Pp 8750 uses the best available option; 876.Pp 8771 forces native mode and fails if not available; 878.Pp 8792 forces emulated hence never fails. 880.It Va dev.netmap.generic_rings: 1 881Number of rings used for emulated netmap mode 882.It Va dev.netmap.generic_ringsize: 1024 883Ring size used for emulated netmap mode 884.It Va dev.netmap.generic_mit: 100000 885Controls interrupt moderation for emulated mode 886.It Va dev.netmap.fwd: 0 887Forces NS_FORWARD mode 888.It Va dev.netmap.txsync_retry: 2 889Number of txsync loops in the 890.Nm VALE 891flush function 892.It Va dev.netmap.no_pendintr: 1 893Forces recovery of transmit buffers on system calls 894.It Va dev.netmap.no_timestamp: 0 895Disables the update of the timestamp in the netmap ring 896.It Va dev.netmap.verbose: 0 897Verbose kernel messages 898.It Va dev.netmap.buf_num: 163840 899.It Va dev.netmap.buf_size: 2048 900.It Va dev.netmap.ring_num: 200 901.It Va dev.netmap.ring_size: 36864 902.It Va dev.netmap.if_num: 100 903.It Va dev.netmap.if_size: 1024 904Sizes and number of objects (netmap_if, netmap_ring, buffers) 905for the global memory region. 906The only parameter worth modifying is 907.Va dev.netmap.buf_num 908as it impacts the total amount of memory used by netmap. 909.It Va dev.netmap.buf_curr_num: 0 910.It Va dev.netmap.buf_curr_size: 0 911.It Va dev.netmap.ring_curr_num: 0 912.It Va dev.netmap.ring_curr_size: 0 913.It Va dev.netmap.if_curr_num: 0 914.It Va dev.netmap.if_curr_size: 0 915Actual values in use. 916.It Va dev.netmap.priv_buf_num: 4098 917.It Va dev.netmap.priv_buf_size: 2048 918.It Va dev.netmap.priv_ring_num: 4 919.It Va dev.netmap.priv_ring_size: 20480 920.It Va dev.netmap.priv_if_num: 2 921.It Va dev.netmap.priv_if_size: 1024 922Sizes and number of objects (netmap_if, netmap_ring, buffers) 923for private memory regions. 924A separate memory region is used for each 925.Nm VALE 926port and each pair of 927.Nm netmap pipes . 928.It Va dev.netmap.bridge_batch: 1024 929Batch size used when moving packets across a 930.Nm VALE 931switch. 932Values above 64 generally guarantee good 933performance. 934.It Va dev.netmap.max_bridges: 8 935Max number of 936.Nm VALE 937switches that can be created. This tunable can be specified 938at loader time. 939.It Va dev.netmap.ptnet_vnet_hdr: 1 940Allow ptnet devices to use virtio-net headers 941.It Va dev.netmap.port_numa_affinity: 0 942On 943.Xr numa 4 944systems, allocate memory for netmap ports from the local NUMA domain when 945possible. 946This can improve performance by reducing the number of remote memory accesses. 947However, when forwarding packets between ports attached to different NUMA 948domains, this will prevent zero-copy forwarding optimizations and thus may hurt 949performance. 950Note that this setting must be specified as a loader tunable at boot time. 951.El 952.Sh SYSTEM CALLS 953.Nm 954uses 955.Xr select 2 , 956.Xr poll 2 , 957.Xr epoll 7 958and 959.Xr kqueue 2 960to wake up processes when significant events occur, and 961.Xr mmap 2 962to map memory. 963.Xr ioctl 2 964is used to configure ports and 965.Nm VALE switches . 966.Pp 967Applications may need to create threads and bind them to 968specific cores to improve performance, using standard 969OS primitives, see 970.Xr pthread 3 . 971In particular, 972.Xr pthread_setaffinity_np 3 973may be of use. 974.Sh EXAMPLES 975.Ss TEST PROGRAMS 976.Nm 977comes with a few programs that can be used for testing or 978simple applications. 979See the 980.Pa examples/ 981directory in 982.Nm 983distributions, or 984.Pa tools/tools/netmap/ 985directory in 986.Fx 987distributions. 988.Pp 989.Xr pkt-gen 8 990is a general purpose traffic source/sink. 991.Pp 992As an example 993.Dl pkt-gen -i ix0 -f tx -l 60 994can generate an infinite stream of minimum size packets, and 995.Dl pkt-gen -i ix0 -f rx 996is a traffic sink. 997Both print traffic statistics, to help monitor 998how the system performs. 999.Pp 1000.Xr pkt-gen 8 1001has many options can be uses to set packet sizes, addresses, 1002rates, and use multiple send/receive threads and cores. 1003.Pp 1004.Xr bridge 4 1005is another test program which interconnects two 1006.Nm 1007ports. 1008It can be used for transparent forwarding between 1009interfaces, as in 1010.Dl bridge -i netmap:ix0 -i netmap:ix1 1011or even connect the NIC to the host stack using netmap 1012.Dl bridge -i netmap:ix0 1013.Ss USING THE NATIVE API 1014The following code implements a traffic generator: 1015.Pp 1016.Bd -literal -compact 1017#include <net/netmap_user.h> 1018\&... 1019void sender(void) 1020{ 1021 struct netmap_if *nifp; 1022 struct netmap_ring *ring; 1023 struct nmreq nmr; 1024 struct pollfd fds; 1025 1026 fd = open("/dev/netmap", O_RDWR); 1027 bzero(&nmr, sizeof(nmr)); 1028 strcpy(nmr.nr_name, "ix0"); 1029 nmr.nm_version = NETMAP_API; 1030 ioctl(fd, NIOCREGIF, &nmr); 1031 p = mmap(0, nmr.nr_memsize, fd); 1032 nifp = NETMAP_IF(p, nmr.nr_offset); 1033 ring = NETMAP_TXRING(nifp, 0); 1034 fds.fd = fd; 1035 fds.events = POLLOUT; 1036 for (;;) { 1037 poll(&fds, 1, -1); 1038 while (!nm_ring_empty(ring)) { 1039 i = ring->cur; 1040 buf = NETMAP_BUF(ring, ring->slot[i].buf_index); 1041 ... prepare packet in buf ... 1042 ring->slot[i].len = ... packet length ... 1043 ring->head = ring->cur = nm_ring_next(ring, i); 1044 } 1045 } 1046} 1047.Ed 1048.Ss HELPER FUNCTIONS 1049A simple receiver can be implemented using the helper functions: 1050.Pp 1051.Bd -literal -compact 1052#define NETMAP_WITH_LIBS 1053#include <net/netmap_user.h> 1054\&... 1055void receiver(void) 1056{ 1057 struct nm_desc *d; 1058 struct pollfd fds; 1059 u_char *buf; 1060 struct nm_pkthdr h; 1061 ... 1062 d = nm_open("netmap:ix0", NULL, 0, 0); 1063 fds.fd = NETMAP_FD(d); 1064 fds.events = POLLIN; 1065 for (;;) { 1066 poll(&fds, 1, -1); 1067 while ( (buf = nm_nextpkt(d, &h)) ) 1068 consume_pkt(buf, h.len); 1069 } 1070 nm_close(d); 1071} 1072.Ed 1073.Ss ZERO-COPY FORWARDING 1074Since physical interfaces share the same memory region, 1075it is possible to do packet forwarding between ports 1076swapping buffers. 1077The buffer from the transmit ring is used 1078to replenish the receive ring: 1079.Pp 1080.Bd -literal -compact 1081 uint32_t tmp; 1082 struct netmap_slot *src, *dst; 1083 ... 1084 src = &src_ring->slot[rxr->cur]; 1085 dst = &dst_ring->slot[txr->cur]; 1086 tmp = dst->buf_idx; 1087 dst->buf_idx = src->buf_idx; 1088 dst->len = src->len; 1089 dst->flags = NS_BUF_CHANGED; 1090 src->buf_idx = tmp; 1091 src->flags = NS_BUF_CHANGED; 1092 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); 1093 txr->head = txr->cur = nm_ring_next(txr, txr->cur); 1094 ... 1095.Ed 1096.Ss ACCESSING THE HOST STACK 1097The host stack is for all practical purposes just a regular ring pair, 1098which you can access with the netmap API (e.g., with 1099.Dl nm_open("netmap:eth0^", ... ) ; 1100All packets that the host would send to an interface in 1101.Nm 1102mode end up into the RX ring, whereas all packets queued to the 1103TX ring are send up to the host stack. 1104.Ss VALE SWITCH 1105A simple way to test the performance of a 1106.Nm VALE 1107switch is to attach a sender and a receiver to it, 1108e.g., running the following in two different terminals: 1109.Dl pkt-gen -i vale1:a -f rx # receiver 1110.Dl pkt-gen -i vale1:b -f tx # sender 1111The same example can be used to test netmap pipes, by simply 1112changing port names, e.g., 1113.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side 1114.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side 1115.Pp 1116The following command attaches an interface and the host stack 1117to a switch: 1118.Dl valectl -h vale2:em0 1119Other 1120.Nm 1121clients attached to the same switch can now communicate 1122with the network card or the host. 1123.Sh SEE ALSO 1124.Xr vale 4 , 1125.Xr bridge 8 , 1126.Xr lb 8 , 1127.Xr nmreplay 8 , 1128.Xr pkt-gen 8 , 1129.Xr valectl 8 1130.Pp 1131.Pa http://info.iet.unipi.it/~luigi/netmap/ 1132.Pp 1133Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, 1134Communications of the ACM, 55 (3), pp.45-51, March 2012 1135.Pp 1136Luigi Rizzo, netmap: a novel framework for fast packet I/O, 1137Usenix ATC'12, June 2012, Boston 1138.Pp 1139Luigi Rizzo, Giuseppe Lettieri, 1140VALE, a switched ethernet for virtual machines, 1141ACM CoNEXT'12, December 2012, Nice 1142.Pp 1143Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, 1144Speeding up packet I/O in virtual machines, 1145ACM/IEEE ANCS'13, October 2013, San Jose 1146.Sh AUTHORS 1147.An -nosplit 1148The 1149.Nm 1150framework has been originally designed and implemented at the 1151Universita` di Pisa in 2011 by 1152.An Luigi Rizzo , 1153and further extended with help from 1154.An Matteo Landi , 1155.An Gaetano Catalli , 1156.An Giuseppe Lettieri , 1157and 1158.An Vincenzo Maffione . 1159.Pp 1160.Nm 1161and 1162.Nm VALE 1163have been funded by the European Commission within FP7 Projects 1164CHANGE (257422) and OPENLAB (287581). 1165.Sh CAVEATS 1166No matter how fast the CPU and OS are, 1167achieving line rate on 10G and faster interfaces 1168requires hardware with sufficient performance. 1169Several NICs are unable to sustain line rate with 1170small packet sizes. 1171Insufficient PCIe or memory bandwidth 1172can also cause reduced performance. 1173.Pp 1174Another frequent reason for low performance is the use 1175of flow control on the link: a slow receiver can limit 1176the transmit speed. 1177Be sure to disable flow control when running high 1178speed experiments. 1179.Ss SPECIAL NIC FEATURES 1180.Nm 1181is orthogonal to some NIC features such as 1182multiqueue, schedulers, packet filters. 1183.Pp 1184Multiple transmit and receive rings are supported natively 1185and can be configured with ordinary OS tools, 1186such as 1187.Xr ethtool 8 1188or 1189device-specific sysctl variables. 1190The same goes for Receive Packet Steering (RPS) 1191and filtering of incoming traffic. 1192.Pp 1193.Nm 1194.Em does not use 1195features such as 1196.Em checksum offloading , TCP segmentation offloading , 1197.Em encryption , VLAN encapsulation/decapsulation , 1198etc. 1199When using netmap to exchange packets with the host stack, 1200make sure to disable these features. 1201