xref: /freebsd/share/man/man4/netmap.4 (revision 2f834a0b41079f9be4dc33ff877d50a5fba869d4)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd February 13, 2014
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.br
37.Nm VALE
38.Nd a fast VirtuAl Local Ethernet using the netmap API
39.br
40.Nm netmap pipes
41.Nd a shared memory packet transport channel
42.Sh SYNOPSIS
43.Cd device netmap
44.Sh DESCRIPTION
45.Nm
46is a framework for extremely fast and efficient packet I/O
47for both userspace and kernel clients.
48It runs on FreeBSD and Linux,
49and includes
50.Nm VALE ,
51a very fast and modular in-kernel software switch/dataplane,
52and
53.Nm netmap pipes ,
54a shared memory packet transport channel.
55All these are accessed interchangeably with the same API.
56.Pp
57.Nm , VALE
58and
59.Nm netmap pipes
60are at least one order of magnitude faster than
61standard OS mechanisms
62(sockets, bpf, tun/tap interfaces, native switches, pipes),
63reaching 14.88 million packets per second (Mpps)
64with much less than one core on a 10 Gbit NIC,
65about 20 Mpps per core for VALE ports,
66and over 100 Mpps for netmap pipes.
67.Pp
68Userspace clients can dynamically switch NICs into
69.Nm
70mode and send and receive raw packets through
71memory mapped buffers.
72Similarly,
73.Nm VALE
74switch instances and ports, and
75.Nm netmap pipes
76can be created dynamically,
77providing high speed packet I/O between processes,
78virtual machines, NICs and the host stack.
79.Pp
80.Nm
81suports both non-blocking I/O through
82.Xr ioctls() ,
83synchronization and blocking I/O through a file descriptor
84and standard OS mechanisms such as
85.Xr select 2 ,
86.Xr poll 2 ,
87.Xr epoll 2 ,
88.Xr kqueue 2 .
89.Nm VALE
90and
91.Nm netmap pipes
92are implemented by a single kernel module, which also emulates the
93.Nm
94API over standard drivers for devices without native
95.Nm
96support.
97For best performance,
98.Nm
99requires explicit support in device drivers.
100.Pp
101In the rest of this (long) manual page we document
102various aspects of the
103.Nm
104and
105.Nm VALE
106architecture, features and usage.
107.Sh ARCHITECTURE
108.Nm
109supports raw packet I/O through a
110.Em port ,
111which can be connected to a physical interface
112.Em ( NIC ) ,
113to the host stack,
114or to a
115.Nm VALE
116switch).
117Ports use preallocated circular queues of buffers
118.Em ( rings )
119residing in an mmapped region.
120There is one ring for each transmit/receive queue of a
121NIC or virtual port.
122An additional ring pair connects to the host stack.
123.Pp
124After binding a file descriptor to a port, a
125.Nm
126client can send or receive packets in batches through
127the rings, and possibly implement zero-copy forwarding
128between ports.
129.Pp
130All NICs operating in
131.Nm
132mode use the same memory region,
133accessible to all processes who own
134.Nm /dev/netmap
135file descriptors bound to NICs.
136Independent
137.Nm VALE
138and
139.Nm netmap pipe
140ports
141by default use separate memory regions,
142but can be independently configured to share memory.
143.Sh ENTERING AND EXITING NETMAP MODE
144The following section describes the system calls to create
145and control
146.Nm netmap
147ports (including
148.Nm VALE
149and
150.Nm netmap pipe
151ports).
152Simpler, higher level functions are described in section
153.Xr LIBRARIES .
154.Pp
155Ports and rings are created and controlled through a file descriptor,
156created by opening a special device
157.Dl fd = open("/dev/netmap");
158and then bound to a specific port with an
159.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
160.Pp
161.Nm
162has multiple modes of operation controlled by the
163.Vt struct nmreq
164argument.
165.Va arg.nr_name
166specifies the port name, as follows:
167.Bl -tag -width XXXX
168.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
169the data path of the NIC is disconnected from the host stack,
170and the file descriptor is bound to the NIC (one or all queues),
171or to the host stack;
172.It Dv valeXXX:YYY (arbitrary XXX and YYY)
173the file descriptor is bound to port YYY of a VALE switch called XXX,
174both dynamically created if necessary.
175The string cannot exceed IFNAMSIZ characters, and YYY cannot
176be the name of any existing OS network interface.
177.El
178.Pp
179On return,
180.Va arg
181indicates the size of the shared memory region,
182and the number, size and location of all the
183.Nm
184data structures, which can be accessed by mmapping the memory
185.Dl char *mem = mmap(0, arg.nr_memsize, fd);
186.Pp
187Non blocking I/O is done with special
188.Xr ioctl 2
189.Xr select 2
190and
191.Xr poll 2
192on the file descriptor permit blocking I/O.
193.Xr epoll 2
194and
195.Xr kqueue 2
196are not supported on
197.Nm
198file descriptors.
199.Pp
200While a NIC is in
201.Nm
202mode, the OS will still believe the interface is up and running.
203OS-generated packets for that NIC end up into a
204.Nm
205ring, and another ring is used to send packets into the OS network stack.
206A
207.Xr close 2
208on the file descriptor removes the binding,
209and returns the NIC to normal mode (reconnecting the data path
210to the host stack), or destroys the virtual port.
211.Sh DATA STRUCTURES
212The data structures in the mmapped memory region are detailed in
213.Xr sys/net/netmap.h ,
214which is the ultimate reference for the
215.Nm
216API. The main structures and fields are indicated below:
217.Bl -tag -width XXX
218.It Dv struct netmap_if (one per interface)
219.Bd -literal
220struct netmap_if {
221    ...
222    const uint32_t   ni_flags;      /* properties              */
223    ...
224    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
225    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
226    uint32_t         ni_bufs_head;  /* head of extra bufs list */
227    ...
228};
229.Ed
230.Pp
231Indicates the number of available rings
232.Pa ( struct netmap_rings )
233and their position in the mmapped region.
234The number of tx and rx rings
235.Pa ( ni_tx_rings , ni_rx_rings )
236normally depends on the hardware.
237NICs also have an extra tx/rx ring pair connected to the host stack.
238.Em NIOCREGIF
239can also request additional unbound buffers in the same memory space,
240to be used as temporary storage for packets.
241.Pa ni_bufs_head
242contains the index of the first of these free rings,
243which are connected in a list (the first uint32_t of each
244buffer being the index of the next buffer in the list).
245A 0 indicates the end of the list.
246.Pp
247.It Dv struct netmap_ring (one per ring)
248.Bd -literal
249struct netmap_ring {
250    ...
251    const uint32_t num_slots;   /* slots in each ring            */
252    const uint32_t nr_buf_size; /* size of each buffer           */
253    ...
254    uint32_t       head;        /* (u) first buf owned by user   */
255    uint32_t       cur;         /* (u) wakeup position           */
256    const uint32_t tail;        /* (k) first buf owned by kernel */
257    ...
258    uint32_t       flags;
259    struct timeval ts;          /* (k) time of last rxsync()     */
260    ...
261    struct netmap_slot slot[0]; /* array of slots                */
262}
263.Ed
264.Pp
265Implements transmit and receive rings, with read/write
266pointers, metadata and and an array of
267.Pa slots
268describing the buffers.
269.Pp
270.It Dv struct netmap_slot (one per buffer)
271.Bd -literal
272struct netmap_slot {
273    uint32_t buf_idx;           /* buffer index                 */
274    uint16_t len;               /* packet length                */
275    uint16_t flags;             /* buf changed, etc.            */
276    uint64_t ptr;               /* address for indirect buffers */
277};
278.Ed
279.Pp
280Describes a packet buffer, which normally is identified by
281an index and resides in the mmapped region.
282.It Dv packet buffers
283Fixed size (normally 2 KB) packet buffers allocated by the kernel.
284.El
285.Pp
286The offset of the
287.Pa struct netmap_if
288in the mmapped region is indicated by the
289.Pa nr_offset
290field in the structure returned by
291.Pa NIOCREGIF .
292From there, all other objects are reachable through
293relative references (offsets or indexes).
294Macros and functions in <net/netmap_user.h>
295help converting them into actual pointers:
296.Pp
297.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
298.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
299.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
300.Pp
301.Dl char *buf = NETMAP_BUF(ring, buffer_index);
302.Sh RINGS, BUFFERS AND DATA I/O
303.Va Rings
304are circular queues of packets with three indexes/pointers
305.Va ( head , cur , tail ) ;
306one slot is always kept empty.
307The ring size
308.Va ( num_slots )
309should not be assumed to be a power of two.
310.br
311(NOTE: older versions of netmap used head/count format to indicate
312the content of a ring).
313.Pp
314.Va head
315is the first slot available to userspace;
316.br
317.Va cur
318is the wakeup point:
319select/poll will unblock when
320.Va tail
321passes
322.Va cur ;
323.br
324.Va tail
325is the first slot reserved to the kernel.
326.Pp
327Slot indexes MUST only move forward;
328for convenience, the function
329.Dl nm_ring_next(ring, index)
330returns the next index modulo the ring size.
331.Pp
332.Va head
333and
334.Va cur
335are only modified by the user program;
336.Va tail
337is only modified by the kernel.
338The kernel only reads/writes the
339.Vt struct netmap_ring
340slots and buffers
341during the execution of a netmap-related system call.
342The only exception are slots (and buffers) in the range
343.Va tail\  . . . head-1 ,
344that are explicitly assigned to the kernel.
345.Pp
346.Ss TRANSMIT RINGS
347On transmit rings, after a
348.Nm
349system call, slots in the range
350.Va head\  . . . tail-1
351are available for transmission.
352User code should fill the slots sequentially
353and advance
354.Va head
355and
356.Va cur
357past slots ready to transmit.
358.Va cur
359may be moved further ahead if the user code needs
360more slots before further transmissions (see
361.Sx SCATTER GATHER I/O ) .
362.Pp
363At the next NIOCTXSYNC/select()/poll(),
364slots up to
365.Va head-1
366are pushed to the port, and
367.Va tail
368may advance if further slots have become available.
369Below is an example of the evolution of a TX ring:
370.Pp
371.Bd -literal
372    after the syscall, slots between cur and tail are (a)vailable
373              head=cur   tail
374               |          |
375               v          v
376     TX  [.....aaaaaaaaaaa.............]
377
378    user creates new packets to (T)ransmit
379                head=cur tail
380                    |     |
381                    v     v
382     TX  [.....TTTTTaaaaaa.............]
383
384    NIOCTXSYNC/poll()/select() sends packets and reports new slots
385                head=cur      tail
386                    |          |
387                    v          v
388     TX  [..........aaaaaaaaaaa........]
389.Ed
390.Pp
391select() and poll() wlll block if there is no space in the ring, i.e.
392.Dl ring->cur == ring->tail
393and return when new slots have become available.
394.Pp
395High speed applications may want to amortize the cost of system calls
396by preparing as many packets as possible before issuing them.
397.Pp
398A transmit ring with pending transmissions has
399.Dl ring->head != ring->tail + 1 (modulo the ring size).
400The function
401.Va int nm_tx_pending(ring)
402implements this test.
403.Ss RECEIVE RINGS
404On receive rings, after a
405.Nm
406system call, the slots in the range
407.Va head\& . . . tail-1
408contain received packets.
409User code should process them and advance
410.Va head
411and
412.Va cur
413past slots it wants to return to the kernel.
414.Va cur
415may be moved further ahead if the user code wants to
416wait for more packets
417without returning all the previous slots to the kernel.
418.Pp
419At the next NIOCRXSYNC/select()/poll(),
420slots up to
421.Va head-1
422are returned to the kernel for further receives, and
423.Va tail
424may advance to report new incoming packets.
425.br
426Below is an example of the evolution of an RX ring:
427.Bd -literal
428    after the syscall, there are some (h)eld and some (R)eceived slots
429           head  cur     tail
430            |     |       |
431            v     v       v
432     RX  [..hhhhhhRRRRRRRR..........]
433
434    user advances head and cur, releasing some slots and holding others
435               head cur  tail
436                 |  |     |
437                 v  v     v
438     RX  [..*****hhhRRRRRR...........]
439
440    NICRXSYNC/poll()/select() recovers slots and reports new packets
441               head cur        tail
442                 |  |           |
443                 v  v           v
444     RX  [.......hhhRRRRRRRRRRRR....]
445.Ed
446.Sh SLOTS AND PACKET BUFFERS
447Normally, packets should be stored in the netmap-allocated buffers
448assigned to slots when ports are bound to a file descriptor.
449One packet is fully contained in a single buffer.
450.Pp
451The following flags affect slot and buffer processing:
452.Bl -tag -width XXX
453.It NS_BUF_CHANGED
454it MUST be used when the buf_idx in the slot is changed.
455This can be used to implement
456zero-copy forwarding, see
457.Sx ZERO-COPY FORWARDING .
458.Pp
459.It NS_REPORT
460reports when this buffer has been transmitted.
461Normally,
462.Nm
463notifies transmit completions in batches, hence signals
464can be delayed indefinitely. This flag helps detecting
465when packets have been send and a file descriptor can be closed.
466.It NS_FORWARD
467When a ring is in 'transparent' mode (see
468.Sx TRANSPARENT MODE ) ,
469packets marked with this flags are forwarded to the other endpoint
470at the next system call, thus restoring (in a selective way)
471the connection between a NIC and the host stack.
472.It NS_NO_LEARN
473tells the forwarding code that the SRC MAC address for this
474packet must not be used in the learning bridge code.
475.It NS_INDIRECT
476indicates that the packet's payload is in a user-supplied buffer,
477whose user virtual address is in the 'ptr' field of the slot.
478The size can reach 65535 bytes.
479.br
480This is only supported on the transmit ring of
481.Nm VALE
482ports, and it helps reducing data copies in the interconnection
483of virtual machines.
484.It NS_MOREFRAG
485indicates that the packet continues with subsequent buffers;
486the last buffer in a packet must have the flag clear.
487.El
488.Sh SCATTER GATHER I/O
489Packets can span multiple slots if the
490.Va NS_MOREFRAG
491flag is set in all but the last slot.
492The maximum length of a chain is 64 buffers.
493This is normally used with
494.Nm VALE
495ports when connecting virtual machines, as they generate large
496TSO segments that are not split unless they reach a physical device.
497.Pp
498NOTE: The length field always refers to the individual
499fragment; there is no place with the total length of a packet.
500.Pp
501On receive rings the macro
502.Va NS_RFRAGS(slot)
503indicates the remaining number of slots for this packet,
504including the current one.
505Slots with a value greater than 1 also have NS_MOREFRAG set.
506.Sh IOCTLS
507.Nm
508uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
509for non-blocking I/O. They take no argument.
510Two more ioctls (NIOCGINFO, NIOCREGIF) are used
511to query and configure ports, with the following argument:
512.Bd -literal
513struct nmreq {
514    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
515    uint32_t  nr_version;        /* (i) API version                */
516    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
517    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
518    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
519    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
520    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
521    uint16_t  nr_rx_rings;       /* (i/o) number of tx rings       */
522    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
523    uint16_t  nr_cmd;            /* (i) special command            */
524    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
525    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
526    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
527    uint32_t  nr_flags           /* (i/o) open mode                */
528    ...
529};
530.Ed
531.Pp
532A file descriptor obtained through
533.Pa /dev/netmap
534also supports the ioctl supported by network devices, see
535.Xr netintro 4 .
536.Pp
537.Bl -tag -width XXXX
538.It Dv NIOCGINFO
539returns EINVAL if the named port does not support netmap.
540Otherwise, it returns 0 and (advisory) information
541about the port.
542Note that all the information below can change before the
543interface is actually put in netmap mode.
544.Pp
545.Bl -tag -width XX
546.It Pa nr_memsize
547indicates the size of the
548.Nm
549memory region. NICs in
550.Nm
551mode all share the same memory region,
552whereas
553.Nm VALE
554ports have independent regions for each port.
555.It Pa nr_tx_slots , nr_rx_slots
556indicate the size of transmit and receive rings.
557.It Pa nr_tx_rings , nr_rx_rings
558indicate the number of transmit
559and receive rings.
560Both ring number and sizes may be configured at runtime
561using interface-specific functions (e.g.
562.Xr ethtool
563).
564.El
565.It Dv NIOCREGIF
566binds the port named in
567.Va nr_name
568to the file descriptor. For a physical device this also switches it into
569.Nm
570mode, disconnecting
571it from the host stack.
572Multiple file descriptors can be bound to the same port,
573with proper synchronization left to the user.
574.Pp
575.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
576.Em netmap pipe ,
577consisting of two netmap ports with a crossover connection.
578A netmap pipe share the same memory space of the parent port,
579and is meant to enable configuration where a master process acts
580as a dispatcher towards slave processes.
581.Pp
582To enable this function, the
583.Pa nr_arg1
584field of the structure can be used as a hint to the kernel to
585indicate how many pipes we expect to use, and reserve extra space
586in the memory region.
587.Pp
588On return, it gives the same info as NIOCGINFO,
589with
590.Pa nr_ringid
591and
592.Pa nr_flags
593indicating the identity of the rings controlled through the file
594descriptor.
595.Pp
596.Va nr_flags
597.Va nr_ringid
598selects which rings are controlled through this file descriptor.
599Possible values of
600.Pa nr_flags
601are indicated below, together with the naming schemes
602that application libraries (such as the
603.Nm nm_open
604indicated below) can use to indicate the specific set of rings.
605In the example below, "netmap:foo" is any valid netmap port name.
606.Pp
607.Bl -tag -width XXXXX
608.It NR_REG_ALL_NIC                         "netmap:foo"
609(default) all hardware ring pairs
610.It NR_REG_SW            "netmap:foo^"
611the ``host rings'', connecting to the host stack.
612.It NR_REG_NIC_SW        "netmap:foo+"
613all hardware rings and the host rings
614.It NR_REG_ONE_NIC       "netmap:foo-i"
615only the i-th hardware ring pair, where the number is in
616.Pa nr_ringid ;
617.It NR_REG_PIPE_MASTER  "netmap:foo{i"
618the master side of the netmap pipe whose identifier (i) is in
619.Pa nr_ringid ;
620.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
621the slave side of the netmap pipe whose identifier (i) is in
622.Pa nr_ringid .
623.Pp
624The identifier of a pipe must be thought as part of the pipe name,
625and does not need to be sequential. On return the pipe
626will only have a single ring pair with index 0,
627irrespective of the value of i.
628.El
629.Pp
630By default, a
631.Xr poll 2
632or
633.Xr select 2
634call pushes out any pending packets on the transmit ring, even if
635no write events are specified.
636The feature can be disabled by or-ing
637.Va NETMAP_NO_TX_POLL
638to the value written to
639.Va nr_ringid.
640When this feature is used,
641packets are transmitted only on
642.Va ioctl(NIOCTXSYNC)
643or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
644.Pp
645When registering a virtual interface that is dynamically created to a
646.Xr vale 4
647switch, we can specify the desired number of rings (1 by default,
648and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
649.It Dv NIOCTXSYNC
650tells the hardware of new packets to transmit, and updates the
651number of slots available for transmission.
652.It Dv NIOCRXSYNC
653tells the hardware of consumed packets, and asks for newly available
654packets.
655.El
656.Sh SELECT, POLL, EPOLL, KQUEUE.
657.Xr select 2
658and
659.Xr poll 2
660on a
661.Nm
662file descriptor process rings as indicated in
663.Sx TRANSMIT RINGS
664and
665.Sx RECEIVE RINGS ,
666respectively when write (POLLOUT) and read (POLLIN) events are requested.
667Both block if no slots are available in the ring
668.Va ( ring->cur == ring->tail ) .
669Depending on the platform,
670.Xr epoll 2
671and
672.Xr kqueue 2
673are supported too.
674.Pp
675Packets in transmit rings are normally pushed out
676(and buffers reclaimed) even without
677requesting write events. Passing the NETMAP_NO_TX_POLL flag to
678.Em NIOCREGIF
679disables this feature.
680By default, receive rings are processed only if read
681events are requested. Passing the NETMAP_DO_RX_POLL flag to
682.Em NIOCREGIF updates receive rings even without read events.
683Note that on epoll and kqueue, NETMAP_NO_TX_POLL and NETMAP_DO_RX_POLL
684only have an effect when some event is posted for the file descriptor.
685.Sh LIBRARIES
686The
687.Nm
688API is supposed to be used directly, both because of its simplicity and
689for efficient integration with applications.
690.Pp
691For conveniency, the
692.Va <net/netmap_user.h>
693header provides a few macros and functions to ease creating
694a file descriptor and doing I/O with a
695.Nm
696port. These are loosely modeled after the
697.Xr pcap 3
698API, to ease porting of libpcap-based applications to
699.Nm .
700To use these extra functions, programs should
701.Dl #define NETMAP_WITH_LIBS
702before
703.Dl #include <net/netmap_user.h>
704.Pp
705The following functions are available:
706.Bl -tag -width XXXXX
707.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
708similar to
709.Xr pcap_open ,
710binds a file descriptor to a port.
711.Bl -tag -width XX
712.It Va ifname
713is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
714.Nm VALE
715port.
716.It Va req
717provides the initial values for the argument to the NIOCREGIF ioctl.
718The nm_flags and nm_ringid values are overwritten by parsing
719ifname and flags, and other fields can be overridden through
720the other two arguments.
721.It Va arg
722points to a struct nm_desc containing arguments (e.g. from a previously
723open file descriptor) that should override the defaults.
724The fields are used as described below
725.It Va flags
726can be set to a combination of the following flags:
727.Va NETMAP_NO_TX_POLL ,
728.Va NETMAP_DO_RX_POLL
729(copied into nr_ringid);
730.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
731avoids the mmap and uses the values from it);
732.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
733.Va NM_OPEN_ARG1 ,
734.Va NM_OPEN_ARG2 ,
735.Va NM_OPEN_ARG3 (uses the fields from arg);
736.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
737.El
738.It Va int nm_close(struct nm_desc *d)
739closes the file descriptor, unmaps memory, frees resources.
740.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
741similar to pcap_inject(), pushes a packet to a ring, returns the size
742of the packet is successful, or 0 on error;
743.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
744similar to pcap_dispatch(), applies a callback to incoming packets
745.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
746similar to pcap_next(), fetches the next packet
747.El
748.Sh SUPPORTED DEVICES
749.Nm
750natively supports the following devices:
751.Pp
752On FreeBSD:
753.Xr em 4 ,
754.Xr igb 4 ,
755.Xr ixgbe 4 ,
756.Xr lem 4 ,
757.Xr re 4 .
758.Pp
759On Linux
760.Xr e1000 4 ,
761.Xr e1000e 4 ,
762.Xr igb 4 ,
763.Xr ixgbe 4 ,
764.Xr mlx4 4 ,
765.Xr forcedeth 4 ,
766.Xr r8169 4 .
767.Pp
768NICs without native support can still be used in
769.Nm
770mode through emulation. Performance is inferior to native netmap
771mode but still significantly higher than sockets, and approaching
772that of in-kernel solutions such as Linux's
773.Xr pktgen .
774.Pp
775Emulation is also available for devices with native netmap support,
776which can be used for testing or performance comparison.
777The sysctl variable
778.Va dev.netmap.admode
779globally controls how netmap mode is implemented.
780.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
781Some aspect of the operation of
782.Nm
783are controlled through sysctl variables on FreeBSD
784.Em ( dev.netmap.* )
785and module parameters on Linux
786.Em ( /sys/module/netmap_lin/parameters/* ) :
787.Pp
788.Bl -tag -width indent
789.It Va dev.netmap.admode: 0
790Controls the use of native or emulated adapter mode.
7910 uses the best available option, 1 forces native and
792fails if not available, 2 forces emulated hence never fails.
793.It Va dev.netmap.generic_ringsize: 1024
794Ring size used for emulated netmap mode
795.It Va dev.netmap.generic_mit: 100000
796Controls interrupt moderation for emulated mode
797.It Va dev.netmap.mmap_unreg: 0
798.It Va dev.netmap.fwd: 0
799Forces NS_FORWARD mode
800.It Va dev.netmap.flags: 0
801.It Va dev.netmap.txsync_retry: 2
802.It Va dev.netmap.no_pendintr: 1
803Forces recovery of transmit buffers on system calls
804.It Va dev.netmap.mitigate: 1
805Propagates interrupt mitigation to user processes
806.It Va dev.netmap.no_timestamp: 0
807Disables the update of the timestamp in the netmap ring
808.It Va dev.netmap.verbose: 0
809Verbose kernel messages
810.It Va dev.netmap.buf_num: 163840
811.It Va dev.netmap.buf_size: 2048
812.It Va dev.netmap.ring_num: 200
813.It Va dev.netmap.ring_size: 36864
814.It Va dev.netmap.if_num: 100
815.It Va dev.netmap.if_size: 1024
816Sizes and number of objects (netmap_if, netmap_ring, buffers)
817for the global memory region. The only parameter worth modifying is
818.Va dev.netmap.buf_num
819as it impacts the total amount of memory used by netmap.
820.It Va dev.netmap.buf_curr_num: 0
821.It Va dev.netmap.buf_curr_size: 0
822.It Va dev.netmap.ring_curr_num: 0
823.It Va dev.netmap.ring_curr_size: 0
824.It Va dev.netmap.if_curr_num: 0
825.It Va dev.netmap.if_curr_size: 0
826Actual values in use.
827.It Va dev.netmap.bridge_batch: 1024
828Batch size used when moving packets across a
829.Nm VALE
830switch. Values above 64 generally guarantee good
831performance.
832.El
833.Sh SYSTEM CALLS
834.Nm
835uses
836.Xr select 2 ,
837.Xr poll 2 ,
838.Xr epoll
839and
840.Xr kqueue
841to wake up processes when significant events occur, and
842.Xr mmap 2
843to map memory.
844.Xr ioctl 2
845is used to configure ports and
846.Nm VALE switches .
847.Pp
848Applications may need to create threads and bind them to
849specific cores to improve performance, using standard
850OS primitives, see
851.Xr pthread 3 .
852In particular,
853.Xr pthread_setaffinity_np 3
854may be of use.
855.Sh CAVEATS
856No matter how fast the CPU and OS are,
857achieving line rate on 10G and faster interfaces
858requires hardware with sufficient performance.
859Several NICs are unable to sustain line rate with
860small packet sizes. Insufficient PCIe or memory bandwidth
861can also cause reduced performance.
862.Pp
863Another frequent reason for low performance is the use
864of flow control on the link: a slow receiver can limit
865the transmit speed.
866Be sure to disable flow control when running high
867speed experiments.
868.Pp
869.Ss SPECIAL NIC FEATURES
870.Nm
871is orthogonal to some NIC features such as
872multiqueue, schedulers, packet filters.
873.Pp
874Multiple transmit and receive rings are supported natively
875and can be configured with ordinary OS tools,
876such as
877.Xr ethtool
878or
879device-specific sysctl variables.
880The same goes for Receive Packet Steering (RPS)
881and filtering of incoming traffic.
882.Pp
883.Nm
884.Em does not use
885features such as
886.Em checksum offloading , TCP segmentation offloading ,
887.Em encryption , VLAN encapsulation/decapsulation ,
888etc. .
889When using netmap to exchange packets with the host stack,
890make sure to disable these features.
891.Sh EXAMPLES
892.Ss TEST PROGRAMS
893.Nm
894comes with a few programs that can be used for testing or
895simple applications.
896See the
897.Va examples/
898directory in
899.Nm
900distributions, or
901.Va tools/tools/netmap/
902directory in FreeBSD distributions.
903.Pp
904.Xr pkt-gen
905is a general purpose traffic source/sink.
906.Pp
907As an example
908.Dl pkt-gen -i ix0 -f tx -l 60
909can generate an infinite stream of minimum size packets, and
910.Dl pkt-gen -i ix0 -f rx
911is a traffic sink.
912Both print traffic statistics, to help monitor
913how the system performs.
914.Pp
915.Xr pkt-gen
916has many options can be uses to set packet sizes, addresses,
917rates, and use multiple send/receive threads and cores.
918.Pp
919.Xr bridge
920is another test program which interconnects two
921.Nm
922ports. It can be used for transparent forwarding between
923interfaces, as in
924.Dl bridge -i ix0 -i ix1
925or even connect the NIC to the host stack using netmap
926.Dl bridge -i ix0 -i ix0
927.Ss USING THE NATIVE API
928The following code implements a traffic generator
929.Pp
930.Bd -literal -compact
931#include <net/netmap_user.h>
932...
933void sender(void)
934{
935    struct netmap_if *nifp;
936    struct netmap_ring *ring;
937    struct nmreq nmr;
938    struct pollfd fds;
939
940    fd = open("/dev/netmap", O_RDWR);
941    bzero(&nmr, sizeof(nmr));
942    strcpy(nmr.nr_name, "ix0");
943    nmr.nm_version = NETMAP_API;
944    ioctl(fd, NIOCREGIF, &nmr);
945    p = mmap(0, nmr.nr_memsize, fd);
946    nifp = NETMAP_IF(p, nmr.nr_offset);
947    ring = NETMAP_TXRING(nifp, 0);
948    fds.fd = fd;
949    fds.events = POLLOUT;
950    for (;;) {
951	poll(&fds, 1, -1);
952	while (!nm_ring_empty(ring)) {
953	    i = ring->cur;
954	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
955	    ... prepare packet in buf ...
956	    ring->slot[i].len = ... packet length ...
957	    ring->head = ring->cur = nm_ring_next(ring, i);
958	}
959    }
960}
961.Ed
962.Ss HELPER FUNCTIONS
963A simple receiver can be implemented using the helper functions
964.Bd -literal -compact
965#define NETMAP_WITH_LIBS
966#include <net/netmap_user.h>
967...
968void receiver(void)
969{
970    struct nm_desc *d;
971    struct pollfd fds;
972    u_char *buf;
973    struct nm_pkthdr h;
974    ...
975    d = nm_open("netmap:ix0", NULL, 0, 0);
976    fds.fd = NETMAP_FD(d);
977    fds.events = POLLIN;
978    for (;;) {
979	poll(&fds, 1, -1);
980        while ( (buf = nm_nextpkt(d, &h)) )
981	    consume_pkt(buf, h->len);
982    }
983    nm_close(d);
984}
985.Ed
986.Ss ZERO-COPY FORWARDING
987Since physical interfaces share the same memory region,
988it is possible to do packet forwarding between ports
989swapping buffers. The buffer from the transmit ring is used
990to replenish the receive ring:
991.Bd -literal -compact
992    uint32_t tmp;
993    struct netmap_slot *src, *dst;
994    ...
995    src = &src_ring->slot[rxr->cur];
996    dst = &dst_ring->slot[txr->cur];
997    tmp = dst->buf_idx;
998    dst->buf_idx = src->buf_idx;
999    dst->len = src->len;
1000    dst->flags = NS_BUF_CHANGED;
1001    src->buf_idx = tmp;
1002    src->flags = NS_BUF_CHANGED;
1003    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1004    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1005    ...
1006.Ed
1007.Ss ACCESSING THE HOST STACK
1008The host stack is for all practical purposes just a regular ring pair,
1009which you can access with the netmap API (e.g. with
1010.Dl nm_open("netmap:eth0^", ... ) ;
1011All packets that the host would send to an interface in
1012.Nm
1013mode end up into the RX ring, whereas all packets queued to the
1014TX ring are send up to the host stack.
1015.Ss VALE SWITCH
1016A simple way to test the performance of a
1017.Nm VALE
1018switch is to attach a sender and a receiver to it,
1019e.g. running the following in two different terminals:
1020.Dl pkt-gen -i vale1:a -f rx # receiver
1021.Dl pkt-gen -i vale1:b -f tx # sender
1022The same example can be used to test netmap pipes, by simply
1023changing port names, e.g.
1024.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
1025.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
1026.Pp
1027The following command attaches an interface and the host stack
1028to a switch:
1029.Dl vale-ctl -h vale2:em0
1030Other
1031.Nm
1032clients attached to the same switch can now communicate
1033with the network card or the host.
1034.Sh SEE ALSO
1035.Pa http://info.iet.unipi.it/~luigi/netmap/
1036.Pp
1037Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1038Communications of the ACM, 55 (3), pp.45-51, March 2012
1039.Pp
1040Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1041Usenix ATC'12, June 2012, Boston
1042.Pp
1043Luigi Rizzo, Giuseppe Lettieri,
1044VALE, a switched ethernet for virtual machines,
1045ACM CoNEXT'12, December 2012, Nice
1046.Pp
1047Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1048Speeding up packet I/O in virtual machines,
1049ACM/IEEE ANCS'13, October 2013, San Jose
1050.Sh AUTHORS
1051.An -nosplit
1052The
1053.Nm
1054framework has been originally designed and implemented at the
1055Universita` di Pisa in 2011 by
1056.An Luigi Rizzo ,
1057and further extended with help from
1058.An Matteo Landi ,
1059.An Gaetano Catalli ,
1060.An Giuseppe Lettieri ,
1061.An Vincenzo Maffione .
1062.Pp
1063.Nm
1064and
1065.Nm VALE
1066have been funded by the European Commission within FP7 Projects
1067CHANGE (257422) and OPENLAB (287581).
1068