xref: /freebsd/share/man/man4/netmap.4 (revision 74d9553e43cfafc29448d0bb836916aa21dea0de)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd December 14, 2015
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.br
37.Nm VALE
38.Nd a fast VirtuAl Local Ethernet using the netmap API
39.br
40.Nm netmap pipes
41.Nd a shared memory packet transport channel
42.Sh SYNOPSIS
43.Cd device netmap
44.Sh DESCRIPTION
45.Nm
46is a framework for extremely fast and efficient packet I/O
47for userspace and kernel clients, and for Virtual Machines.
48It runs on
49.Fx
50Linux and some versions of Windows, and supports a variety of
51.Nm netmap ports ,
52including
53.Bl -tag -width XXXX
54.It Nm physical NIC ports
55to access individual queues of network interfaces;
56.It Nm host ports
57to inject packets into the host stack;
58.It Nm VALE ports
59implementing a very fast and modular in-kernel software switch/dataplane;
60.It Nm netmap pipes
61a shared memory packet transport channel;
62.It Nm netmap monitors
63a mechanism similar to
64.Xr bpf
65to capture traffic
66.El
67.Pp
68All these
69.Nm netmap ports
70are accessed interchangeably with the same API,
71and are at least one order of magnitude faster than
72standard OS mechanisms
73(sockets, bpf, tun/tap interfaces, native switches, pipes).
74With suitably fast hardware (NICs, PCIe buses, CPUs),
75packet I/O using
76.Nm
77on supported NICs
78reaches 14.88 million packets per second (Mpps)
79with much less than one core on 10 Gbit/s NICs;
8035-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
81about 20 Mpps per core for VALE ports;
82and over 100 Mpps for
83.Nm netmap pipes.
84NICs without native
85.Nm
86support can still use the API in emulated mode,
87which uses unmodified device drivers and is 3-5 times faster than
88.Xr bpf
89or raw sockets.
90.Pp
91Userspace clients can dynamically switch NICs into
92.Nm
93mode and send and receive raw packets through
94memory mapped buffers.
95Similarly,
96.Nm VALE
97switch instances and ports,
98.Nm netmap pipes
99and
100.Nm netmap monitors
101can be created dynamically,
102providing high speed packet I/O between processes,
103virtual machines, NICs and the host stack.
104.Pp
105.Nm
106supports both non-blocking I/O through
107.Xr ioctl 2 ,
108synchronization and blocking I/O through a file descriptor
109and standard OS mechanisms such as
110.Xr select 2 ,
111.Xr poll 2 ,
112.Xr epoll 2 ,
113and
114.Xr kqueue 2 .
115All types of
116.Nm netmap ports
117and the
118.Nm VALE switch
119are implemented by a single kernel module, which also emulates the
120.Nm
121API over standard drivers.
122For best performance,
123.Nm
124requires native support in device drivers.
125A list of such devices is at the end of this document.
126.Pp
127In the rest of this (long) manual page we document
128various aspects of the
129.Nm
130and
131.Nm VALE
132architecture, features and usage.
133.Sh ARCHITECTURE
134.Nm
135supports raw packet I/O through a
136.Em port ,
137which can be connected to a physical interface
138.Em ( NIC ) ,
139to the host stack,
140or to a
141.Nm VALE
142switch.
143Ports use preallocated circular queues of buffers
144.Em ( rings )
145residing in an mmapped region.
146There is one ring for each transmit/receive queue of a
147NIC or virtual port.
148An additional ring pair connects to the host stack.
149.Pp
150After binding a file descriptor to a port, a
151.Nm
152client can send or receive packets in batches through
153the rings, and possibly implement zero-copy forwarding
154between ports.
155.Pp
156All NICs operating in
157.Nm
158mode use the same memory region,
159accessible to all processes who own
160.Pa /dev/netmap
161file descriptors bound to NICs.
162Independent
163.Nm VALE
164and
165.Nm netmap pipe
166ports
167by default use separate memory regions,
168but can be independently configured to share memory.
169.Sh ENTERING AND EXITING NETMAP MODE
170The following section describes the system calls to create
171and control
172.Nm netmap
173ports (including
174.Nm VALE
175and
176.Nm netmap pipe
177ports).
178Simpler, higher level functions are described in section
179.Xr LIBRARIES .
180.Pp
181Ports and rings are created and controlled through a file descriptor,
182created by opening a special device
183.Dl fd = open("/dev/netmap");
184and then bound to a specific port with an
185.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
186.Pp
187.Nm
188has multiple modes of operation controlled by the
189.Vt struct nmreq
190argument.
191.Va arg.nr_name
192specifies the netmap port name, as follows:
193.Bl -tag -width XXXX
194.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
195the data path of the NIC is disconnected from the host stack,
196and the file descriptor is bound to the NIC (one or all queues),
197or to the host stack;
198.It Dv valeSSS:PPP
199the file descriptor is bound to port PPP of VALE switch SSS.
200Switch instances and ports are dynamically created if necessary.
201.br
202Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string
203cannot exceed IFNAMSIZ characters, and PPP cannot
204be the name of any existing OS network interface.
205.El
206.Pp
207On return,
208.Va arg
209indicates the size of the shared memory region,
210and the number, size and location of all the
211.Nm
212data structures, which can be accessed by mmapping the memory
213.Dl char *mem = mmap(0, arg.nr_memsize, fd);
214.Pp
215Non-blocking I/O is done with special
216.Xr ioctl 2
217.Xr select 2
218and
219.Xr poll 2
220on the file descriptor permit blocking I/O.
221.Xr epoll 2
222and
223.Xr kqueue 2
224are not supported on
225.Nm
226file descriptors.
227.Pp
228While a NIC is in
229.Nm
230mode, the OS will still believe the interface is up and running.
231OS-generated packets for that NIC end up into a
232.Nm
233ring, and another ring is used to send packets into the OS network stack.
234A
235.Xr close 2
236on the file descriptor removes the binding,
237and returns the NIC to normal mode (reconnecting the data path
238to the host stack), or destroys the virtual port.
239.Sh DATA STRUCTURES
240The data structures in the mmapped memory region are detailed in
241.In sys/net/netmap.h ,
242which is the ultimate reference for the
243.Nm
244API.
245The main structures and fields are indicated below:
246.Bl -tag -width XXX
247.It Dv struct netmap_if (one per interface)
248.Bd -literal
249struct netmap_if {
250    ...
251    const uint32_t   ni_flags;      /* properties              */
252    ...
253    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
254    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
255    uint32_t         ni_bufs_head;  /* head of extra bufs list */
256    ...
257};
258.Ed
259.Pp
260Indicates the number of available rings
261.Pa ( struct netmap_rings )
262and their position in the mmapped region.
263The number of tx and rx rings
264.Pa ( ni_tx_rings , ni_rx_rings )
265normally depends on the hardware.
266NICs also have an extra tx/rx ring pair connected to the host stack.
267.Em NIOCREGIF
268can also request additional unbound buffers in the same memory space,
269to be used as temporary storage for packets.
270.Pa ni_bufs_head
271contains the index of the first of these free rings,
272which are connected in a list (the first uint32_t of each
273buffer being the index of the next buffer in the list).
274A
275.Dv 0
276indicates the end of the list.
277.It Dv struct netmap_ring (one per ring)
278.Bd -literal
279struct netmap_ring {
280    ...
281    const uint32_t num_slots;   /* slots in each ring            */
282    const uint32_t nr_buf_size; /* size of each buffer           */
283    ...
284    uint32_t       head;        /* (u) first buf owned by user   */
285    uint32_t       cur;         /* (u) wakeup position           */
286    const uint32_t tail;        /* (k) first buf owned by kernel */
287    ...
288    uint32_t       flags;
289    struct timeval ts;          /* (k) time of last rxsync()     */
290    ...
291    struct netmap_slot slot[0]; /* array of slots                */
292}
293.Ed
294.Pp
295Implements transmit and receive rings, with read/write
296pointers, metadata and an array of
297.Em slots
298describing the buffers.
299.It Dv struct netmap_slot (one per buffer)
300.Bd -literal
301struct netmap_slot {
302    uint32_t buf_idx;           /* buffer index                 */
303    uint16_t len;               /* packet length                */
304    uint16_t flags;             /* buf changed, etc.            */
305    uint64_t ptr;               /* address for indirect buffers */
306};
307.Ed
308.Pp
309Describes a packet buffer, which normally is identified by
310an index and resides in the mmapped region.
311.It Dv packet buffers
312Fixed size (normally 2 KB) packet buffers allocated by the kernel.
313.El
314.Pp
315The offset of the
316.Pa struct netmap_if
317in the mmapped region is indicated by the
318.Pa nr_offset
319field in the structure returned by
320.Dv NIOCREGIF .
321From there, all other objects are reachable through
322relative references (offsets or indexes).
323Macros and functions in
324.In net/netmap_user.h
325help converting them into actual pointers:
326.Pp
327.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
328.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
329.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
330.Pp
331.Dl char *buf = NETMAP_BUF(ring, buffer_index);
332.Sh RINGS, BUFFERS AND DATA I/O
333.Va Rings
334are circular queues of packets with three indexes/pointers
335.Va ( head , cur , tail ) ;
336one slot is always kept empty.
337The ring size
338.Va ( num_slots )
339should not be assumed to be a power of two.
340.Pp
341.Va head
342is the first slot available to userspace;
343.br
344.Va cur
345is the wakeup point:
346select/poll will unblock when
347.Va tail
348passes
349.Va cur ;
350.br
351.Va tail
352is the first slot reserved to the kernel.
353.Pp
354Slot indexes
355.Em must
356only move forward;
357for convenience, the function
358.Dl nm_ring_next(ring, index)
359returns the next index modulo the ring size.
360.Pp
361.Va head
362and
363.Va cur
364are only modified by the user program;
365.Va tail
366is only modified by the kernel.
367The kernel only reads/writes the
368.Vt struct netmap_ring
369slots and buffers
370during the execution of a netmap-related system call.
371The only exception are slots (and buffers) in the range
372.Va tail\  . . . head-1 ,
373that are explicitly assigned to the kernel.
374.Pp
375.Ss TRANSMIT RINGS
376On transmit rings, after a
377.Nm
378system call, slots in the range
379.Va head\  . . . tail-1
380are available for transmission.
381User code should fill the slots sequentially
382and advance
383.Va head
384and
385.Va cur
386past slots ready to transmit.
387.Va cur
388may be moved further ahead if the user code needs
389more slots before further transmissions (see
390.Sx SCATTER GATHER I/O ) .
391.Pp
392At the next NIOCTXSYNC/select()/poll(),
393slots up to
394.Va head-1
395are pushed to the port, and
396.Va tail
397may advance if further slots have become available.
398Below is an example of the evolution of a TX ring:
399.Bd -literal
400    after the syscall, slots between cur and tail are (a)vailable
401              head=cur   tail
402               |          |
403               v          v
404     TX  [.....aaaaaaaaaaa.............]
405
406    user creates new packets to (T)ransmit
407                head=cur tail
408                    |     |
409                    v     v
410     TX  [.....TTTTTaaaaaa.............]
411
412    NIOCTXSYNC/poll()/select() sends packets and reports new slots
413                head=cur      tail
414                    |          |
415                    v          v
416     TX  [..........aaaaaaaaaaa........]
417.Ed
418.Pp
419.Fn select
420and
421.Fn poll
422will block if there is no space in the ring, i.e.
423.Dl ring->cur == ring->tail
424and return when new slots have become available.
425.Pp
426High speed applications may want to amortize the cost of system calls
427by preparing as many packets as possible before issuing them.
428.Pp
429A transmit ring with pending transmissions has
430.Dl ring->head != ring->tail + 1 (modulo the ring size).
431The function
432.Va int nm_tx_pending(ring)
433implements this test.
434.Ss RECEIVE RINGS
435On receive rings, after a
436.Nm
437system call, the slots in the range
438.Va head\& . . . tail-1
439contain received packets.
440User code should process them and advance
441.Va head
442and
443.Va cur
444past slots it wants to return to the kernel.
445.Va cur
446may be moved further ahead if the user code wants to
447wait for more packets
448without returning all the previous slots to the kernel.
449.Pp
450At the next NIOCRXSYNC/select()/poll(),
451slots up to
452.Va head-1
453are returned to the kernel for further receives, and
454.Va tail
455may advance to report new incoming packets.
456.br
457Below is an example of the evolution of an RX ring:
458.Bd -literal
459    after the syscall, there are some (h)eld and some (R)eceived slots
460           head  cur     tail
461            |     |       |
462            v     v       v
463     RX  [..hhhhhhRRRRRRRR..........]
464
465    user advances head and cur, releasing some slots and holding others
466               head cur  tail
467                 |  |     |
468                 v  v     v
469     RX  [..*****hhhRRRRRR...........]
470
471    NICRXSYNC/poll()/select() recovers slots and reports new packets
472               head cur        tail
473                 |  |           |
474                 v  v           v
475     RX  [.......hhhRRRRRRRRRRRR....]
476.Ed
477.Sh SLOTS AND PACKET BUFFERS
478Normally, packets should be stored in the netmap-allocated buffers
479assigned to slots when ports are bound to a file descriptor.
480One packet is fully contained in a single buffer.
481.Pp
482The following flags affect slot and buffer processing:
483.Bl -tag -width XXX
484.It NS_BUF_CHANGED
485.Em must
486be used when the
487.Va buf_idx
488in the slot is changed.
489This can be used to implement
490zero-copy forwarding, see
491.Sx ZERO-COPY FORWARDING .
492.It NS_REPORT
493reports when this buffer has been transmitted.
494Normally,
495.Nm
496notifies transmit completions in batches, hence signals
497can be delayed indefinitely.
498This flag helps detect
499when packets have been sent and a file descriptor can be closed.
500.It NS_FORWARD
501When a ring is in 'transparent' mode (see
502.Sx TRANSPARENT MODE ) ,
503packets marked with this flag are forwarded to the other endpoint
504at the next system call, thus restoring (in a selective way)
505the connection between a NIC and the host stack.
506.It NS_NO_LEARN
507tells the forwarding code that the source MAC address for this
508packet must not be used in the learning bridge code.
509.It NS_INDIRECT
510indicates that the packet's payload is in a user-supplied buffer
511whose user virtual address is in the 'ptr' field of the slot.
512The size can reach 65535 bytes.
513.br
514This is only supported on the transmit ring of
515.Nm VALE
516ports, and it helps reducing data copies in the interconnection
517of virtual machines.
518.It NS_MOREFRAG
519indicates that the packet continues with subsequent buffers;
520the last buffer in a packet must have the flag clear.
521.El
522.Sh SCATTER GATHER I/O
523Packets can span multiple slots if the
524.Va NS_MOREFRAG
525flag is set in all but the last slot.
526The maximum length of a chain is 64 buffers.
527This is normally used with
528.Nm VALE
529ports when connecting virtual machines, as they generate large
530TSO segments that are not split unless they reach a physical device.
531.Pp
532NOTE: The length field always refers to the individual
533fragment; there is no place with the total length of a packet.
534.Pp
535On receive rings the macro
536.Va NS_RFRAGS(slot)
537indicates the remaining number of slots for this packet,
538including the current one.
539Slots with a value greater than 1 also have NS_MOREFRAG set.
540.Sh IOCTLS
541.Nm
542uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
543for non-blocking I/O.
544They take no argument.
545Two more ioctls (NIOCGINFO, NIOCREGIF) are used
546to query and configure ports, with the following argument:
547.Bd -literal
548struct nmreq {
549    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
550    uint32_t  nr_version;        /* (i) API version                */
551    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
552    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
553    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
554    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
555    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
556    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
557    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
558    uint16_t  nr_cmd;            /* (i) special command            */
559    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
560    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
561    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
562    uint32_t  nr_flags           /* (i/o) open mode                */
563    ...
564};
565.Ed
566.Pp
567A file descriptor obtained through
568.Pa /dev/netmap
569also supports the ioctl supported by network devices, see
570.Xr netintro 4 .
571.Bl -tag -width XXXX
572.It Dv NIOCGINFO
573returns EINVAL if the named port does not support netmap.
574Otherwise, it returns 0 and (advisory) information
575about the port.
576Note that all the information below can change before the
577interface is actually put in netmap mode.
578.Bl -tag -width XX
579.It Pa nr_memsize
580indicates the size of the
581.Nm
582memory region.
583NICs in
584.Nm
585mode all share the same memory region,
586whereas
587.Nm VALE
588ports have independent regions for each port.
589.It Pa nr_tx_slots , nr_rx_slots
590indicate the size of transmit and receive rings.
591.It Pa nr_tx_rings , nr_rx_rings
592indicate the number of transmit
593and receive rings.
594Both ring number and sizes may be configured at runtime
595using interface-specific functions (e.g.
596.Xr ethtool
597).
598.El
599.It Dv NIOCREGIF
600binds the port named in
601.Va nr_name
602to the file descriptor.
603For a physical device this also switches it into
604.Nm
605mode, disconnecting
606it from the host stack.
607Multiple file descriptors can be bound to the same port,
608with proper synchronization left to the user.
609.Pp
610The recommended way to bind a file descriptor to a port is
611to use function
612.Va nm_open(..)
613(see
614.Xr LIBRARIES )
615which parses names to access specific port types and
616enable features.
617In the following we document the main features.
618.Pp
619.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
620.Em netmap pipe ,
621consisting of two netmap ports with a crossover connection.
622A netmap pipe share the same memory space of the parent port,
623and is meant to enable configuration where a master process acts
624as a dispatcher towards slave processes.
625.Pp
626To enable this function, the
627.Pa nr_arg1
628field of the structure can be used as a hint to the kernel to
629indicate how many pipes we expect to use, and reserve extra space
630in the memory region.
631.Pp
632On return, it gives the same info as NIOCGINFO,
633with
634.Pa nr_ringid
635and
636.Pa nr_flags
637indicating the identity of the rings controlled through the file
638descriptor.
639.Pp
640.Va nr_flags
641.Va nr_ringid
642selects which rings are controlled through this file descriptor.
643Possible values of
644.Pa nr_flags
645are indicated below, together with the naming schemes
646that application libraries (such as the
647.Nm nm_open
648indicated below) can use to indicate the specific set of rings.
649In the example below, "netmap:foo" is any valid netmap port name.
650.Bl -tag -width XXXXX
651.It NR_REG_ALL_NIC                         "netmap:foo"
652(default) all hardware ring pairs
653.It NR_REG_SW            "netmap:foo^"
654the ``host rings'', connecting to the host stack.
655.It NR_REG_NIC_SW        "netmap:foo+"
656all hardware rings and the host rings
657.It NR_REG_ONE_NIC       "netmap:foo-i"
658only the i-th hardware ring pair, where the number is in
659.Pa nr_ringid ;
660.It NR_REG_PIPE_MASTER  "netmap:foo{i"
661the master side of the netmap pipe whose identifier (i) is in
662.Pa nr_ringid ;
663.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
664the slave side of the netmap pipe whose identifier (i) is in
665.Pa nr_ringid .
666.Pp
667The identifier of a pipe must be thought as part of the pipe name,
668and does not need to be sequential.
669On return the pipe
670will only have a single ring pair with index 0,
671irrespective of the value of
672.Va i.
673.El
674.Pp
675By default, a
676.Xr poll 2
677or
678.Xr select 2
679call pushes out any pending packets on the transmit ring, even if
680no write events are specified.
681The feature can be disabled by or-ing
682.Va NETMAP_NO_TX_POLL
683to the value written to
684.Va nr_ringid.
685When this feature is used,
686packets are transmitted only on
687.Va ioctl(NIOCTXSYNC)
688or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
689.Pp
690When registering a virtual interface that is dynamically created to a
691.Xr vale 4
692switch, we can specify the desired number of rings (1 by default,
693and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
694.It Dv NIOCTXSYNC
695tells the hardware of new packets to transmit, and updates the
696number of slots available for transmission.
697.It Dv NIOCRXSYNC
698tells the hardware of consumed packets, and asks for newly available
699packets.
700.El
701.Sh SELECT, POLL, EPOLL, KQUEUE.
702.Xr select 2
703and
704.Xr poll 2
705on a
706.Nm
707file descriptor process rings as indicated in
708.Sx TRANSMIT RINGS
709and
710.Sx RECEIVE RINGS ,
711respectively when write (POLLOUT) and read (POLLIN) events are requested.
712Both block if no slots are available in the ring
713.Va ( ring->cur == ring->tail ) .
714Depending on the platform,
715.Xr epoll 2
716and
717.Xr kqueue 2
718are supported too.
719.Pp
720Packets in transmit rings are normally pushed out
721(and buffers reclaimed) even without
722requesting write events.
723Passing the
724.Dv NETMAP_NO_TX_POLL
725flag to
726.Em NIOCREGIF
727disables this feature.
728By default, receive rings are processed only if read
729events are requested.
730Passing the
731.Dv NETMAP_DO_RX_POLL
732flag to
733.Em NIOCREGIF updates receive rings even without read events.
734Note that on epoll and kqueue,
735.Dv NETMAP_NO_TX_POLL
736and
737.Dv NETMAP_DO_RX_POLL
738only have an effect when some event is posted for the file descriptor.
739.Sh LIBRARIES
740The
741.Nm
742API is supposed to be used directly, both because of its simplicity and
743for efficient integration with applications.
744.Pp
745For convenience, the
746.In net/netmap_user.h
747header provides a few macros and functions to ease creating
748a file descriptor and doing I/O with a
749.Nm
750port.
751These are loosely modeled after the
752.Xr pcap 3
753API, to ease porting of libpcap-based applications to
754.Nm .
755To use these extra functions, programs should
756.Dl #define NETMAP_WITH_LIBS
757before
758.Dl #include <net/netmap_user.h>
759.Pp
760The following functions are available:
761.Bl -tag -width XXXXX
762.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
763similar to
764.Xr pcap_open ,
765binds a file descriptor to a port.
766.Bl -tag -width XX
767.It Va ifname
768is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
769.Nm VALE
770port.
771.It Va req
772provides the initial values for the argument to the NIOCREGIF ioctl.
773The nm_flags and nm_ringid values are overwritten by parsing
774ifname and flags, and other fields can be overridden through
775the other two arguments.
776.It Va arg
777points to a struct nm_desc containing arguments (e.g. from a previously
778open file descriptor) that should override the defaults.
779The fields are used as described below
780.It Va flags
781can be set to a combination of the following flags:
782.Va NETMAP_NO_TX_POLL ,
783.Va NETMAP_DO_RX_POLL
784(copied into nr_ringid);
785.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
786avoids the mmap and uses the values from it);
787.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
788.Va NM_OPEN_ARG1 ,
789.Va NM_OPEN_ARG2 ,
790.Va NM_OPEN_ARG3 (uses the fields from arg);
791.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
792.El
793.It Va int nm_close(struct nm_desc *d)
794closes the file descriptor, unmaps memory, frees resources.
795.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
796similar to pcap_inject(), pushes a packet to a ring, returns the size
797of the packet is successful, or 0 on error;
798.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
799similar to pcap_dispatch(), applies a callback to incoming packets
800.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
801similar to pcap_next(), fetches the next packet
802.El
803.Sh SUPPORTED DEVICES
804.Nm
805natively supports the following devices:
806.Pp
807On FreeBSD:
808.Xr cxgbe 4 ,
809.Xr em 4 ,
810.Xr igb 4 ,
811.Xr ixgbe 4 ,
812.Xr ixl 4 ,
813.Xr lem 4 ,
814.Xr re 4 .
815.Pp
816On Linux
817.Xr e1000 4 ,
818.Xr e1000e 4 ,
819.Xr i40e 4 ,
820.Xr igb 4 ,
821.Xr ixgbe 4 ,
822.Xr r8169 4 .
823.Pp
824NICs without native support can still be used in
825.Nm
826mode through emulation.
827Performance is inferior to native netmap
828mode but still significantly higher than various raw socket types
829(bpf, PF_PACKET, etc.).
830Note that for slow devices (such as 1 Gbit/s and slower NICs,
831or several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
832emulated and native mode will likely have similar or same throughput.
833.br
834When emulation is in use, packet sniffer programs such as tcpdump
835could see received packets before they are diverted by netmap. This behaviour
836is not intentional, being just an artifact of the implementation of emulation.
837Note that in case the netmap application subsequently moves packets received
838from the emulated adapter onto the host RX ring, the sniffer will intercept
839those packets again, since the packets are injected to the host stack as they
840were received by the network interface.
841.Pp
842Emulation is also available for devices with native netmap support,
843which can be used for testing or performance comparison.
844The sysctl variable
845.Va dev.netmap.admode
846globally controls how netmap mode is implemented.
847.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
848Some aspect of the operation of
849.Nm
850are controlled through sysctl variables on FreeBSD
851.Em ( dev.netmap.* )
852and module parameters on Linux
853.Em ( /sys/module/netmap_lin/parameters/* ) :
854.Bl -tag -width indent
855.It Va dev.netmap.admode: 0
856Controls the use of native or emulated adapter mode.
857.br
8580 uses the best available option;
859.br
8601 forces native mode and fails if not available;
861.br
8622 forces emulated hence never fails.
863.It Va dev.netmap.generic_ringsize: 1024
864Ring size used for emulated netmap mode
865.It Va dev.netmap.generic_mit: 100000
866Controls interrupt moderation for emulated mode
867.It Va dev.netmap.mmap_unreg: 0
868.It Va dev.netmap.fwd: 0
869Forces NS_FORWARD mode
870.It Va dev.netmap.flags: 0
871.It Va dev.netmap.txsync_retry: 2
872.It Va dev.netmap.no_pendintr: 1
873Forces recovery of transmit buffers on system calls
874.It Va dev.netmap.mitigate: 1
875Propagates interrupt mitigation to user processes
876.It Va dev.netmap.no_timestamp: 0
877Disables the update of the timestamp in the netmap ring
878.It Va dev.netmap.verbose: 0
879Verbose kernel messages
880.It Va dev.netmap.buf_num: 163840
881.It Va dev.netmap.buf_size: 2048
882.It Va dev.netmap.ring_num: 200
883.It Va dev.netmap.ring_size: 36864
884.It Va dev.netmap.if_num: 100
885.It Va dev.netmap.if_size: 1024
886Sizes and number of objects (netmap_if, netmap_ring, buffers)
887for the global memory region.
888The only parameter worth modifying is
889.Va dev.netmap.buf_num
890as it impacts the total amount of memory used by netmap.
891.It Va dev.netmap.buf_curr_num: 0
892.It Va dev.netmap.buf_curr_size: 0
893.It Va dev.netmap.ring_curr_num: 0
894.It Va dev.netmap.ring_curr_size: 0
895.It Va dev.netmap.if_curr_num: 0
896.It Va dev.netmap.if_curr_size: 0
897Actual values in use.
898.It Va dev.netmap.bridge_batch: 1024
899Batch size used when moving packets across a
900.Nm VALE
901switch.
902Values above 64 generally guarantee good
903performance.
904.El
905.Sh SYSTEM CALLS
906.Nm
907uses
908.Xr select 2 ,
909.Xr poll 2 ,
910.Xr epoll 2
911and
912.Xr kqueue 2
913to wake up processes when significant events occur, and
914.Xr mmap 2
915to map memory.
916.Xr ioctl 2
917is used to configure ports and
918.Nm VALE switches .
919.Pp
920Applications may need to create threads and bind them to
921specific cores to improve performance, using standard
922OS primitives, see
923.Xr pthread 3 .
924In particular,
925.Xr pthread_setaffinity_np 3
926may be of use.
927.Sh EXAMPLES
928.Ss TEST PROGRAMS
929.Nm
930comes with a few programs that can be used for testing or
931simple applications.
932See the
933.Pa examples/
934directory in
935.Nm
936distributions, or
937.Pa tools/tools/netmap/
938directory in
939.Fx
940distributions.
941.Pp
942.Xr pkt-gen
943is a general purpose traffic source/sink.
944.Pp
945As an example
946.Dl pkt-gen -i ix0 -f tx -l 60
947can generate an infinite stream of minimum size packets, and
948.Dl pkt-gen -i ix0 -f rx
949is a traffic sink.
950Both print traffic statistics, to help monitor
951how the system performs.
952.Pp
953.Xr pkt-gen
954has many options can be uses to set packet sizes, addresses,
955rates, and use multiple send/receive threads and cores.
956.Pp
957.Xr bridge
958is another test program which interconnects two
959.Nm
960ports.
961It can be used for transparent forwarding between
962interfaces, as in
963.Dl bridge -i ix0 -i ix1
964or even connect the NIC to the host stack using netmap
965.Dl bridge -i ix0 -i ix0
966.Ss USING THE NATIVE API
967The following code implements a traffic generator
968.Pp
969.Bd -literal -compact
970#include <net/netmap_user.h>
971\&...
972void sender(void)
973{
974    struct netmap_if *nifp;
975    struct netmap_ring *ring;
976    struct nmreq nmr;
977    struct pollfd fds;
978
979    fd = open("/dev/netmap", O_RDWR);
980    bzero(&nmr, sizeof(nmr));
981    strcpy(nmr.nr_name, "ix0");
982    nmr.nm_version = NETMAP_API;
983    ioctl(fd, NIOCREGIF, &nmr);
984    p = mmap(0, nmr.nr_memsize, fd);
985    nifp = NETMAP_IF(p, nmr.nr_offset);
986    ring = NETMAP_TXRING(nifp, 0);
987    fds.fd = fd;
988    fds.events = POLLOUT;
989    for (;;) {
990	poll(&fds, 1, -1);
991	while (!nm_ring_empty(ring)) {
992	    i = ring->cur;
993	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
994	    ... prepare packet in buf ...
995	    ring->slot[i].len = ... packet length ...
996	    ring->head = ring->cur = nm_ring_next(ring, i);
997	}
998    }
999}
1000.Ed
1001.Ss HELPER FUNCTIONS
1002A simple receiver can be implemented using the helper functions
1003.Bd -literal -compact
1004#define NETMAP_WITH_LIBS
1005#include <net/netmap_user.h>
1006\&...
1007void receiver(void)
1008{
1009    struct nm_desc *d;
1010    struct pollfd fds;
1011    u_char *buf;
1012    struct nm_pkthdr h;
1013    ...
1014    d = nm_open("netmap:ix0", NULL, 0, 0);
1015    fds.fd = NETMAP_FD(d);
1016    fds.events = POLLIN;
1017    for (;;) {
1018	poll(&fds, 1, -1);
1019        while ( (buf = nm_nextpkt(d, &h)) )
1020	    consume_pkt(buf, h->len);
1021    }
1022    nm_close(d);
1023}
1024.Ed
1025.Ss ZERO-COPY FORWARDING
1026Since physical interfaces share the same memory region,
1027it is possible to do packet forwarding between ports
1028swapping buffers.
1029The buffer from the transmit ring is used
1030to replenish the receive ring:
1031.Bd -literal -compact
1032    uint32_t tmp;
1033    struct netmap_slot *src, *dst;
1034    ...
1035    src = &src_ring->slot[rxr->cur];
1036    dst = &dst_ring->slot[txr->cur];
1037    tmp = dst->buf_idx;
1038    dst->buf_idx = src->buf_idx;
1039    dst->len = src->len;
1040    dst->flags = NS_BUF_CHANGED;
1041    src->buf_idx = tmp;
1042    src->flags = NS_BUF_CHANGED;
1043    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1044    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1045    ...
1046.Ed
1047.Ss ACCESSING THE HOST STACK
1048The host stack is for all practical purposes just a regular ring pair,
1049which you can access with the netmap API (e.g. with
1050.Dl nm_open("netmap:eth0^", ... ) ;
1051All packets that the host would send to an interface in
1052.Nm
1053mode end up into the RX ring, whereas all packets queued to the
1054TX ring are send up to the host stack.
1055.Ss VALE SWITCH
1056A simple way to test the performance of a
1057.Nm VALE
1058switch is to attach a sender and a receiver to it,
1059e.g. running the following in two different terminals:
1060.Dl pkt-gen -i vale1:a -f rx # receiver
1061.Dl pkt-gen -i vale1:b -f tx # sender
1062The same example can be used to test netmap pipes, by simply
1063changing port names, e.g.
1064.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
1065.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
1066.Pp
1067The following command attaches an interface and the host stack
1068to a switch:
1069.Dl vale-ctl -h vale2:em0
1070Other
1071.Nm
1072clients attached to the same switch can now communicate
1073with the network card or the host.
1074.Sh SEE ALSO
1075.Pa http://info.iet.unipi.it/~luigi/netmap/
1076.Pp
1077Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1078Communications of the ACM, 55 (3), pp.45-51, March 2012
1079.Pp
1080Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1081Usenix ATC'12, June 2012, Boston
1082.Pp
1083Luigi Rizzo, Giuseppe Lettieri,
1084VALE, a switched ethernet for virtual machines,
1085ACM CoNEXT'12, December 2012, Nice
1086.Pp
1087Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1088Speeding up packet I/O in virtual machines,
1089ACM/IEEE ANCS'13, October 2013, San Jose
1090.Sh AUTHORS
1091.An -nosplit
1092The
1093.Nm
1094framework has been originally designed and implemented at the
1095Universita` di Pisa in 2011 by
1096.An Luigi Rizzo ,
1097and further extended with help from
1098.An Matteo Landi ,
1099.An Gaetano Catalli ,
1100.An Giuseppe Lettieri ,
1101and
1102.An Vincenzo Maffione .
1103.Pp
1104.Nm
1105and
1106.Nm VALE
1107have been funded by the European Commission within FP7 Projects
1108CHANGE (257422) and OPENLAB (287581).
1109.Sh CAVEATS
1110No matter how fast the CPU and OS are,
1111achieving line rate on 10G and faster interfaces
1112requires hardware with sufficient performance.
1113Several NICs are unable to sustain line rate with
1114small packet sizes.
1115Insufficient PCIe or memory bandwidth
1116can also cause reduced performance.
1117.Pp
1118Another frequent reason for low performance is the use
1119of flow control on the link: a slow receiver can limit
1120the transmit speed.
1121Be sure to disable flow control when running high
1122speed experiments.
1123.Ss SPECIAL NIC FEATURES
1124.Nm
1125is orthogonal to some NIC features such as
1126multiqueue, schedulers, packet filters.
1127.Pp
1128Multiple transmit and receive rings are supported natively
1129and can be configured with ordinary OS tools,
1130such as
1131.Xr ethtool
1132or
1133device-specific sysctl variables.
1134The same goes for Receive Packet Steering (RPS)
1135and filtering of incoming traffic.
1136.Pp
1137.Nm
1138.Em does not use
1139features such as
1140.Em checksum offloading , TCP segmentation offloading ,
1141.Em encryption , VLAN encapsulation/decapsulation ,
1142etc.
1143When using netmap to exchange packets with the host stack,
1144make sure to disable these features.
1145