xref: /freebsd/share/man/man4/netmap.4 (revision 1f4bcc459a76b7aa664f3fd557684cd0ba6da352)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd December 14, 2015
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.Pp
37.Nm VALE
38.Nd a fast VirtuAl Local Ethernet using the netmap API
39.Pp
40.Nm netmap pipes
41.Nd a shared memory packet transport channel
42.Sh SYNOPSIS
43.Cd device netmap
44.Sh DESCRIPTION
45.Nm
46is a framework for extremely fast and efficient packet I/O
47for both userspace and kernel clients.
48It runs on
49.Fx
50and Linux, and includes
51.Nm VALE ,
52a very fast and modular in-kernel software switch/dataplane,
53and
54.Nm netmap pipes ,
55a shared memory packet transport channel.
56All these are accessed interchangeably with the same API.
57.Pp
58.Nm ,
59.Nm VALE
60and
61.Nm netmap pipes
62are at least one order of magnitude faster than
63standard OS mechanisms
64(sockets, bpf, tun/tap interfaces, native switches, pipes),
65reaching 14.88 million packets per second (Mpps)
66with much less than one core on a 10 Gbit NIC,
67about 20 Mpps per core for VALE ports,
68and over 100 Mpps for netmap pipes.
69.Pp
70Userspace clients can dynamically switch NICs into
71.Nm
72mode and send and receive raw packets through
73memory mapped buffers.
74Similarly,
75.Nm VALE
76switch instances and ports, and
77.Nm netmap pipes
78can be created dynamically,
79providing high speed packet I/O between processes,
80virtual machines, NICs and the host stack.
81.Pp
82.Nm
83supports both non-blocking I/O through
84.Xr ioctl 2 ,
85synchronization and blocking I/O through a file descriptor
86and standard OS mechanisms such as
87.Xr select 2 ,
88.Xr poll 2 ,
89.Xr epoll 2 ,
90and
91.Xr kqueue 2 .
92.Nm VALE
93and
94.Nm netmap pipes
95are implemented by a single kernel module, which also emulates the
96.Nm
97API over standard drivers for devices without native
98.Nm
99support.
100For best performance,
101.Nm
102requires explicit support in device drivers.
103.Pp
104In the rest of this (long) manual page we document
105various aspects of the
106.Nm
107and
108.Nm VALE
109architecture, features and usage.
110.Sh ARCHITECTURE
111.Nm
112supports raw packet I/O through a
113.Em port ,
114which can be connected to a physical interface
115.Em ( NIC ) ,
116to the host stack,
117or to a
118.Nm VALE
119switch).
120Ports use preallocated circular queues of buffers
121.Em ( rings )
122residing in an mmapped region.
123There is one ring for each transmit/receive queue of a
124NIC or virtual port.
125An additional ring pair connects to the host stack.
126.Pp
127After binding a file descriptor to a port, a
128.Nm
129client can send or receive packets in batches through
130the rings, and possibly implement zero-copy forwarding
131between ports.
132.Pp
133All NICs operating in
134.Nm
135mode use the same memory region,
136accessible to all processes who own
137.Pa /dev/netmap
138file descriptors bound to NICs.
139Independent
140.Nm VALE
141and
142.Nm netmap pipe
143ports
144by default use separate memory regions,
145but can be independently configured to share memory.
146.Sh ENTERING AND EXITING NETMAP MODE
147The following section describes the system calls to create
148and control
149.Nm netmap
150ports (including
151.Nm VALE
152and
153.Nm netmap pipe
154ports).
155Simpler, higher level functions are described in section
156.Xr LIBRARIES .
157.Pp
158Ports and rings are created and controlled through a file descriptor,
159created by opening a special device
160.Dl fd = open("/dev/netmap");
161and then bound to a specific port with an
162.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
163.Pp
164.Nm
165has multiple modes of operation controlled by the
166.Vt struct nmreq
167argument.
168.Va arg.nr_name
169specifies the port name, as follows:
170.Bl -tag -width XXXX
171.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
172the data path of the NIC is disconnected from the host stack,
173and the file descriptor is bound to the NIC (one or all queues),
174or to the host stack;
175.It Dv valeXXX:YYY (arbitrary XXX and YYY)
176the file descriptor is bound to port YYY of a VALE switch called XXX,
177both dynamically created if necessary.
178The string cannot exceed IFNAMSIZ characters, and YYY cannot
179be the name of any existing OS network interface.
180.El
181.Pp
182On return,
183.Va arg
184indicates the size of the shared memory region,
185and the number, size and location of all the
186.Nm
187data structures, which can be accessed by mmapping the memory
188.Dl char *mem = mmap(0, arg.nr_memsize, fd);
189.Pp
190Non-blocking I/O is done with special
191.Xr ioctl 2
192.Xr select 2
193and
194.Xr poll 2
195on the file descriptor permit blocking I/O.
196.Xr epoll 2
197and
198.Xr kqueue 2
199are not supported on
200.Nm
201file descriptors.
202.Pp
203While a NIC is in
204.Nm
205mode, the OS will still believe the interface is up and running.
206OS-generated packets for that NIC end up into a
207.Nm
208ring, and another ring is used to send packets into the OS network stack.
209A
210.Xr close 2
211on the file descriptor removes the binding,
212and returns the NIC to normal mode (reconnecting the data path
213to the host stack), or destroys the virtual port.
214.Sh DATA STRUCTURES
215The data structures in the mmapped memory region are detailed in
216.In sys/net/netmap.h ,
217which is the ultimate reference for the
218.Nm
219API.
220The main structures and fields are indicated below:
221.Bl -tag -width XXX
222.It Dv struct netmap_if (one per interface)
223.Bd -literal
224struct netmap_if {
225    ...
226    const uint32_t   ni_flags;      /* properties              */
227    ...
228    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
229    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
230    uint32_t         ni_bufs_head;  /* head of extra bufs list */
231    ...
232};
233.Ed
234.Pp
235Indicates the number of available rings
236.Pa ( struct netmap_rings )
237and their position in the mmapped region.
238The number of tx and rx rings
239.Pa ( ni_tx_rings , ni_rx_rings )
240normally depends on the hardware.
241NICs also have an extra tx/rx ring pair connected to the host stack.
242.Em NIOCREGIF
243can also request additional unbound buffers in the same memory space,
244to be used as temporary storage for packets.
245.Pa ni_bufs_head
246contains the index of the first of these free rings,
247which are connected in a list (the first uint32_t of each
248buffer being the index of the next buffer in the list).
249A
250.Dv 0
251indicates the end of the list.
252.It Dv struct netmap_ring (one per ring)
253.Bd -literal
254struct netmap_ring {
255    ...
256    const uint32_t num_slots;   /* slots in each ring            */
257    const uint32_t nr_buf_size; /* size of each buffer           */
258    ...
259    uint32_t       head;        /* (u) first buf owned by user   */
260    uint32_t       cur;         /* (u) wakeup position           */
261    const uint32_t tail;        /* (k) first buf owned by kernel */
262    ...
263    uint32_t       flags;
264    struct timeval ts;          /* (k) time of last rxsync()     */
265    ...
266    struct netmap_slot slot[0]; /* array of slots                */
267}
268.Ed
269.Pp
270Implements transmit and receive rings, with read/write
271pointers, metadata and an array of
272.Em slots
273describing the buffers.
274.It Dv struct netmap_slot (one per buffer)
275.Bd -literal
276struct netmap_slot {
277    uint32_t buf_idx;           /* buffer index                 */
278    uint16_t len;               /* packet length                */
279    uint16_t flags;             /* buf changed, etc.            */
280    uint64_t ptr;               /* address for indirect buffers */
281};
282.Ed
283.Pp
284Describes a packet buffer, which normally is identified by
285an index and resides in the mmapped region.
286.It Dv packet buffers
287Fixed size (normally 2 KB) packet buffers allocated by the kernel.
288.El
289.Pp
290The offset of the
291.Pa struct netmap_if
292in the mmapped region is indicated by the
293.Pa nr_offset
294field in the structure returned by
295.Dv NIOCREGIF .
296From there, all other objects are reachable through
297relative references (offsets or indexes).
298Macros and functions in
299.In net/netmap_user.h
300help converting them into actual pointers:
301.Pp
302.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
303.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
304.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
305.Pp
306.Dl char *buf = NETMAP_BUF(ring, buffer_index);
307.Sh RINGS, BUFFERS AND DATA I/O
308.Va Rings
309are circular queues of packets with three indexes/pointers
310.Va ( head , cur , tail ) ;
311one slot is always kept empty.
312The ring size
313.Va ( num_slots )
314should not be assumed to be a power of two.
315.br
316(NOTE: older versions of netmap used head/count format to indicate
317the content of a ring).
318.Pp
319.Va head
320is the first slot available to userspace;
321.br
322.Va cur
323is the wakeup point:
324select/poll will unblock when
325.Va tail
326passes
327.Va cur ;
328.br
329.Va tail
330is the first slot reserved to the kernel.
331.Pp
332Slot indexes
333.Em must
334only move forward;
335for convenience, the function
336.Dl nm_ring_next(ring, index)
337returns the next index modulo the ring size.
338.Pp
339.Va head
340and
341.Va cur
342are only modified by the user program;
343.Va tail
344is only modified by the kernel.
345The kernel only reads/writes the
346.Vt struct netmap_ring
347slots and buffers
348during the execution of a netmap-related system call.
349The only exception are slots (and buffers) in the range
350.Va tail\  . . . head-1 ,
351that are explicitly assigned to the kernel.
352.Pp
353.Ss TRANSMIT RINGS
354On transmit rings, after a
355.Nm
356system call, slots in the range
357.Va head\  . . . tail-1
358are available for transmission.
359User code should fill the slots sequentially
360and advance
361.Va head
362and
363.Va cur
364past slots ready to transmit.
365.Va cur
366may be moved further ahead if the user code needs
367more slots before further transmissions (see
368.Sx SCATTER GATHER I/O ) .
369.Pp
370At the next NIOCTXSYNC/select()/poll(),
371slots up to
372.Va head-1
373are pushed to the port, and
374.Va tail
375may advance if further slots have become available.
376Below is an example of the evolution of a TX ring:
377.Bd -literal
378    after the syscall, slots between cur and tail are (a)vailable
379              head=cur   tail
380               |          |
381               v          v
382     TX  [.....aaaaaaaaaaa.............]
383
384    user creates new packets to (T)ransmit
385                head=cur tail
386                    |     |
387                    v     v
388     TX  [.....TTTTTaaaaaa.............]
389
390    NIOCTXSYNC/poll()/select() sends packets and reports new slots
391                head=cur      tail
392                    |          |
393                    v          v
394     TX  [..........aaaaaaaaaaa........]
395.Ed
396.Pp
397.Fn select
398and
399.Fn poll
400will block if there is no space in the ring, i.e.
401.Dl ring->cur == ring->tail
402and return when new slots have become available.
403.Pp
404High speed applications may want to amortize the cost of system calls
405by preparing as many packets as possible before issuing them.
406.Pp
407A transmit ring with pending transmissions has
408.Dl ring->head != ring->tail + 1 (modulo the ring size).
409The function
410.Va int nm_tx_pending(ring)
411implements this test.
412.Ss RECEIVE RINGS
413On receive rings, after a
414.Nm
415system call, the slots in the range
416.Va head\& . . . tail-1
417contain received packets.
418User code should process them and advance
419.Va head
420and
421.Va cur
422past slots it wants to return to the kernel.
423.Va cur
424may be moved further ahead if the user code wants to
425wait for more packets
426without returning all the previous slots to the kernel.
427.Pp
428At the next NIOCRXSYNC/select()/poll(),
429slots up to
430.Va head-1
431are returned to the kernel for further receives, and
432.Va tail
433may advance to report new incoming packets.
434.br
435Below is an example of the evolution of an RX ring:
436.Bd -literal
437    after the syscall, there are some (h)eld and some (R)eceived slots
438           head  cur     tail
439            |     |       |
440            v     v       v
441     RX  [..hhhhhhRRRRRRRR..........]
442
443    user advances head and cur, releasing some slots and holding others
444               head cur  tail
445                 |  |     |
446                 v  v     v
447     RX  [..*****hhhRRRRRR...........]
448
449    NICRXSYNC/poll()/select() recovers slots and reports new packets
450               head cur        tail
451                 |  |           |
452                 v  v           v
453     RX  [.......hhhRRRRRRRRRRRR....]
454.Ed
455.Sh SLOTS AND PACKET BUFFERS
456Normally, packets should be stored in the netmap-allocated buffers
457assigned to slots when ports are bound to a file descriptor.
458One packet is fully contained in a single buffer.
459.Pp
460The following flags affect slot and buffer processing:
461.Bl -tag -width XXX
462.It NS_BUF_CHANGED
463.Em must
464be used when the
465.Va buf_idx
466in the slot is changed.
467This can be used to implement
468zero-copy forwarding, see
469.Sx ZERO-COPY FORWARDING .
470.It NS_REPORT
471reports when this buffer has been transmitted.
472Normally,
473.Nm
474notifies transmit completions in batches, hence signals
475can be delayed indefinitely.
476This flag helps detect
477when packets have been sent and a file descriptor can be closed.
478.It NS_FORWARD
479When a ring is in 'transparent' mode (see
480.Sx TRANSPARENT MODE ) ,
481packets marked with this flag are forwarded to the other endpoint
482at the next system call, thus restoring (in a selective way)
483the connection between a NIC and the host stack.
484.It NS_NO_LEARN
485tells the forwarding code that the source MAC address for this
486packet must not be used in the learning bridge code.
487.It NS_INDIRECT
488indicates that the packet's payload is in a user-supplied buffer
489whose user virtual address is in the 'ptr' field of the slot.
490The size can reach 65535 bytes.
491.br
492This is only supported on the transmit ring of
493.Nm VALE
494ports, and it helps reducing data copies in the interconnection
495of virtual machines.
496.It NS_MOREFRAG
497indicates that the packet continues with subsequent buffers;
498the last buffer in a packet must have the flag clear.
499.El
500.Sh SCATTER GATHER I/O
501Packets can span multiple slots if the
502.Va NS_MOREFRAG
503flag is set in all but the last slot.
504The maximum length of a chain is 64 buffers.
505This is normally used with
506.Nm VALE
507ports when connecting virtual machines, as they generate large
508TSO segments that are not split unless they reach a physical device.
509.Pp
510NOTE: The length field always refers to the individual
511fragment; there is no place with the total length of a packet.
512.Pp
513On receive rings the macro
514.Va NS_RFRAGS(slot)
515indicates the remaining number of slots for this packet,
516including the current one.
517Slots with a value greater than 1 also have NS_MOREFRAG set.
518.Sh IOCTLS
519.Nm
520uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
521for non-blocking I/O.
522They take no argument.
523Two more ioctls (NIOCGINFO, NIOCREGIF) are used
524to query and configure ports, with the following argument:
525.Bd -literal
526struct nmreq {
527    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
528    uint32_t  nr_version;        /* (i) API version                */
529    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
530    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
531    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
532    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
533    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
534    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
535    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
536    uint16_t  nr_cmd;            /* (i) special command            */
537    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
538    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
539    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
540    uint32_t  nr_flags           /* (i/o) open mode                */
541    ...
542};
543.Ed
544.Pp
545A file descriptor obtained through
546.Pa /dev/netmap
547also supports the ioctl supported by network devices, see
548.Xr netintro 4 .
549.Bl -tag -width XXXX
550.It Dv NIOCGINFO
551returns EINVAL if the named port does not support netmap.
552Otherwise, it returns 0 and (advisory) information
553about the port.
554Note that all the information below can change before the
555interface is actually put in netmap mode.
556.Bl -tag -width XX
557.It Pa nr_memsize
558indicates the size of the
559.Nm
560memory region.
561NICs in
562.Nm
563mode all share the same memory region,
564whereas
565.Nm VALE
566ports have independent regions for each port.
567.It Pa nr_tx_slots , nr_rx_slots
568indicate the size of transmit and receive rings.
569.It Pa nr_tx_rings , nr_rx_rings
570indicate the number of transmit
571and receive rings.
572Both ring number and sizes may be configured at runtime
573using interface-specific functions (e.g.
574.Xr ethtool
575).
576.El
577.It Dv NIOCREGIF
578binds the port named in
579.Va nr_name
580to the file descriptor.
581For a physical device this also switches it into
582.Nm
583mode, disconnecting
584it from the host stack.
585Multiple file descriptors can be bound to the same port,
586with proper synchronization left to the user.
587.Pp
588.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
589.Em netmap pipe ,
590consisting of two netmap ports with a crossover connection.
591A netmap pipe share the same memory space of the parent port,
592and is meant to enable configuration where a master process acts
593as a dispatcher towards slave processes.
594.Pp
595To enable this function, the
596.Pa nr_arg1
597field of the structure can be used as a hint to the kernel to
598indicate how many pipes we expect to use, and reserve extra space
599in the memory region.
600.Pp
601On return, it gives the same info as NIOCGINFO,
602with
603.Pa nr_ringid
604and
605.Pa nr_flags
606indicating the identity of the rings controlled through the file
607descriptor.
608.Pp
609.Va nr_flags
610.Va nr_ringid
611selects which rings are controlled through this file descriptor.
612Possible values of
613.Pa nr_flags
614are indicated below, together with the naming schemes
615that application libraries (such as the
616.Nm nm_open
617indicated below) can use to indicate the specific set of rings.
618In the example below, "netmap:foo" is any valid netmap port name.
619.Bl -tag -width XXXXX
620.It NR_REG_ALL_NIC                         "netmap:foo"
621(default) all hardware ring pairs
622.It NR_REG_SW            "netmap:foo^"
623the ``host rings'', connecting to the host stack.
624.It NR_REG_NIC_SW        "netmap:foo+"
625all hardware rings and the host rings
626.It NR_REG_ONE_NIC       "netmap:foo-i"
627only the i-th hardware ring pair, where the number is in
628.Pa nr_ringid ;
629.It NR_REG_PIPE_MASTER  "netmap:foo{i"
630the master side of the netmap pipe whose identifier (i) is in
631.Pa nr_ringid ;
632.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
633the slave side of the netmap pipe whose identifier (i) is in
634.Pa nr_ringid .
635.Pp
636The identifier of a pipe must be thought as part of the pipe name,
637and does not need to be sequential.
638On return the pipe
639will only have a single ring pair with index 0,
640irrespective of the value of
641.Va i.
642.El
643.Pp
644By default, a
645.Xr poll 2
646or
647.Xr select 2
648call pushes out any pending packets on the transmit ring, even if
649no write events are specified.
650The feature can be disabled by or-ing
651.Va NETMAP_NO_TX_POLL
652to the value written to
653.Va nr_ringid.
654When this feature is used,
655packets are transmitted only on
656.Va ioctl(NIOCTXSYNC)
657or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
658.Pp
659When registering a virtual interface that is dynamically created to a
660.Xr vale 4
661switch, we can specify the desired number of rings (1 by default,
662and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
663.It Dv NIOCTXSYNC
664tells the hardware of new packets to transmit, and updates the
665number of slots available for transmission.
666.It Dv NIOCRXSYNC
667tells the hardware of consumed packets, and asks for newly available
668packets.
669.El
670.Sh SELECT, POLL, EPOLL, KQUEUE.
671.Xr select 2
672and
673.Xr poll 2
674on a
675.Nm
676file descriptor process rings as indicated in
677.Sx TRANSMIT RINGS
678and
679.Sx RECEIVE RINGS ,
680respectively when write (POLLOUT) and read (POLLIN) events are requested.
681Both block if no slots are available in the ring
682.Va ( ring->cur == ring->tail ) .
683Depending on the platform,
684.Xr epoll 2
685and
686.Xr kqueue 2
687are supported too.
688.Pp
689Packets in transmit rings are normally pushed out
690(and buffers reclaimed) even without
691requesting write events.
692Passing the
693.Dv NETMAP_NO_TX_POLL
694flag to
695.Em NIOCREGIF
696disables this feature.
697By default, receive rings are processed only if read
698events are requested.
699Passing the
700.Dv NETMAP_DO_RX_POLL
701flag to
702.Em NIOCREGIF updates receive rings even without read events.
703Note that on epoll and kqueue,
704.Dv NETMAP_NO_TX_POLL
705and
706.Dv NETMAP_DO_RX_POLL
707only have an effect when some event is posted for the file descriptor.
708.Sh LIBRARIES
709The
710.Nm
711API is supposed to be used directly, both because of its simplicity and
712for efficient integration with applications.
713.Pp
714For convenience, the
715.In net/netmap_user.h
716header provides a few macros and functions to ease creating
717a file descriptor and doing I/O with a
718.Nm
719port.
720These are loosely modeled after the
721.Xr pcap 3
722API, to ease porting of libpcap-based applications to
723.Nm .
724To use these extra functions, programs should
725.Dl #define NETMAP_WITH_LIBS
726before
727.Dl #include <net/netmap_user.h>
728.Pp
729The following functions are available:
730.Bl -tag -width XXXXX
731.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
732similar to
733.Xr pcap_open ,
734binds a file descriptor to a port.
735.Bl -tag -width XX
736.It Va ifname
737is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
738.Nm VALE
739port.
740.It Va req
741provides the initial values for the argument to the NIOCREGIF ioctl.
742The nm_flags and nm_ringid values are overwritten by parsing
743ifname and flags, and other fields can be overridden through
744the other two arguments.
745.It Va arg
746points to a struct nm_desc containing arguments (e.g. from a previously
747open file descriptor) that should override the defaults.
748The fields are used as described below
749.It Va flags
750can be set to a combination of the following flags:
751.Va NETMAP_NO_TX_POLL ,
752.Va NETMAP_DO_RX_POLL
753(copied into nr_ringid);
754.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
755avoids the mmap and uses the values from it);
756.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
757.Va NM_OPEN_ARG1 ,
758.Va NM_OPEN_ARG2 ,
759.Va NM_OPEN_ARG3 (uses the fields from arg);
760.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
761.El
762.It Va int nm_close(struct nm_desc *d)
763closes the file descriptor, unmaps memory, frees resources.
764.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
765similar to pcap_inject(), pushes a packet to a ring, returns the size
766of the packet is successful, or 0 on error;
767.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
768similar to pcap_dispatch(), applies a callback to incoming packets
769.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
770similar to pcap_next(), fetches the next packet
771.El
772.Sh SUPPORTED DEVICES
773.Nm
774natively supports the following devices:
775.Pp
776On FreeBSD:
777.Xr em 4 ,
778.Xr igb 4 ,
779.Xr ixgbe 4 ,
780.Xr lem 4 ,
781.Xr re 4 .
782.Pp
783On Linux
784.Xr e1000 4 ,
785.Xr e1000e 4 ,
786.Xr igb 4 ,
787.Xr ixgbe 4 ,
788.Xr mlx4 4 ,
789.Xr forcedeth 4 ,
790.Xr r8169 4 .
791.Pp
792NICs without native support can still be used in
793.Nm
794mode through emulation.
795Performance is inferior to native netmap
796mode but still significantly higher than sockets, and approaching
797that of in-kernel solutions such as Linux's
798.Xr pktgen .
799.Pp
800Emulation is also available for devices with native netmap support,
801which can be used for testing or performance comparison.
802The sysctl variable
803.Va dev.netmap.admode
804globally controls how netmap mode is implemented.
805.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
806Some aspect of the operation of
807.Nm
808are controlled through sysctl variables on FreeBSD
809.Em ( dev.netmap.* )
810and module parameters on Linux
811.Em ( /sys/module/netmap_lin/parameters/* ) :
812.Bl -tag -width indent
813.It Va dev.netmap.admode: 0
814Controls the use of native or emulated adapter mode.
8150 uses the best available option, 1 forces native and
816fails if not available, 2 forces emulated hence never fails.
817.It Va dev.netmap.generic_ringsize: 1024
818Ring size used for emulated netmap mode
819.It Va dev.netmap.generic_mit: 100000
820Controls interrupt moderation for emulated mode
821.It Va dev.netmap.mmap_unreg: 0
822.It Va dev.netmap.fwd: 0
823Forces NS_FORWARD mode
824.It Va dev.netmap.flags: 0
825.It Va dev.netmap.txsync_retry: 2
826.It Va dev.netmap.no_pendintr: 1
827Forces recovery of transmit buffers on system calls
828.It Va dev.netmap.mitigate: 1
829Propagates interrupt mitigation to user processes
830.It Va dev.netmap.no_timestamp: 0
831Disables the update of the timestamp in the netmap ring
832.It Va dev.netmap.verbose: 0
833Verbose kernel messages
834.It Va dev.netmap.buf_num: 163840
835.It Va dev.netmap.buf_size: 2048
836.It Va dev.netmap.ring_num: 200
837.It Va dev.netmap.ring_size: 36864
838.It Va dev.netmap.if_num: 100
839.It Va dev.netmap.if_size: 1024
840Sizes and number of objects (netmap_if, netmap_ring, buffers)
841for the global memory region.
842The only parameter worth modifying is
843.Va dev.netmap.buf_num
844as it impacts the total amount of memory used by netmap.
845.It Va dev.netmap.buf_curr_num: 0
846.It Va dev.netmap.buf_curr_size: 0
847.It Va dev.netmap.ring_curr_num: 0
848.It Va dev.netmap.ring_curr_size: 0
849.It Va dev.netmap.if_curr_num: 0
850.It Va dev.netmap.if_curr_size: 0
851Actual values in use.
852.It Va dev.netmap.bridge_batch: 1024
853Batch size used when moving packets across a
854.Nm VALE
855switch.
856Values above 64 generally guarantee good
857performance.
858.El
859.Sh SYSTEM CALLS
860.Nm
861uses
862.Xr select 2 ,
863.Xr poll 2 ,
864.Xr epoll
865and
866.Xr kqueue
867to wake up processes when significant events occur, and
868.Xr mmap 2
869to map memory.
870.Xr ioctl 2
871is used to configure ports and
872.Nm VALE switches .
873.Pp
874Applications may need to create threads and bind them to
875specific cores to improve performance, using standard
876OS primitives, see
877.Xr pthread 3 .
878In particular,
879.Xr pthread_setaffinity_np 3
880may be of use.
881.Sh EXAMPLES
882.Ss TEST PROGRAMS
883.Nm
884comes with a few programs that can be used for testing or
885simple applications.
886See the
887.Pa examples/
888directory in
889.Nm
890distributions, or
891.Pa tools/tools/netmap/
892directory in
893.Fx
894distributions.
895.Pp
896.Xr pkt-gen
897is a general purpose traffic source/sink.
898.Pp
899As an example
900.Dl pkt-gen -i ix0 -f tx -l 60
901can generate an infinite stream of minimum size packets, and
902.Dl pkt-gen -i ix0 -f rx
903is a traffic sink.
904Both print traffic statistics, to help monitor
905how the system performs.
906.Pp
907.Xr pkt-gen
908has many options can be uses to set packet sizes, addresses,
909rates, and use multiple send/receive threads and cores.
910.Pp
911.Xr bridge
912is another test program which interconnects two
913.Nm
914ports.
915It can be used for transparent forwarding between
916interfaces, as in
917.Dl bridge -i ix0 -i ix1
918or even connect the NIC to the host stack using netmap
919.Dl bridge -i ix0 -i ix0
920.Ss USING THE NATIVE API
921The following code implements a traffic generator
922.Pp
923.Bd -literal -compact
924#include <net/netmap_user.h>
925\&...
926void sender(void)
927{
928    struct netmap_if *nifp;
929    struct netmap_ring *ring;
930    struct nmreq nmr;
931    struct pollfd fds;
932
933    fd = open("/dev/netmap", O_RDWR);
934    bzero(&nmr, sizeof(nmr));
935    strcpy(nmr.nr_name, "ix0");
936    nmr.nm_version = NETMAP_API;
937    ioctl(fd, NIOCREGIF, &nmr);
938    p = mmap(0, nmr.nr_memsize, fd);
939    nifp = NETMAP_IF(p, nmr.nr_offset);
940    ring = NETMAP_TXRING(nifp, 0);
941    fds.fd = fd;
942    fds.events = POLLOUT;
943    for (;;) {
944	poll(&fds, 1, -1);
945	while (!nm_ring_empty(ring)) {
946	    i = ring->cur;
947	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
948	    ... prepare packet in buf ...
949	    ring->slot[i].len = ... packet length ...
950	    ring->head = ring->cur = nm_ring_next(ring, i);
951	}
952    }
953}
954.Ed
955.Ss HELPER FUNCTIONS
956A simple receiver can be implemented using the helper functions
957.Bd -literal -compact
958#define NETMAP_WITH_LIBS
959#include <net/netmap_user.h>
960\&...
961void receiver(void)
962{
963    struct nm_desc *d;
964    struct pollfd fds;
965    u_char *buf;
966    struct nm_pkthdr h;
967    ...
968    d = nm_open("netmap:ix0", NULL, 0, 0);
969    fds.fd = NETMAP_FD(d);
970    fds.events = POLLIN;
971    for (;;) {
972	poll(&fds, 1, -1);
973        while ( (buf = nm_nextpkt(d, &h)) )
974	    consume_pkt(buf, h->len);
975    }
976    nm_close(d);
977}
978.Ed
979.Ss ZERO-COPY FORWARDING
980Since physical interfaces share the same memory region,
981it is possible to do packet forwarding between ports
982swapping buffers.
983The buffer from the transmit ring is used
984to replenish the receive ring:
985.Bd -literal -compact
986    uint32_t tmp;
987    struct netmap_slot *src, *dst;
988    ...
989    src = &src_ring->slot[rxr->cur];
990    dst = &dst_ring->slot[txr->cur];
991    tmp = dst->buf_idx;
992    dst->buf_idx = src->buf_idx;
993    dst->len = src->len;
994    dst->flags = NS_BUF_CHANGED;
995    src->buf_idx = tmp;
996    src->flags = NS_BUF_CHANGED;
997    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
998    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
999    ...
1000.Ed
1001.Ss ACCESSING THE HOST STACK
1002The host stack is for all practical purposes just a regular ring pair,
1003which you can access with the netmap API (e.g. with
1004.Dl nm_open("netmap:eth0^", ... ) ;
1005All packets that the host would send to an interface in
1006.Nm
1007mode end up into the RX ring, whereas all packets queued to the
1008TX ring are send up to the host stack.
1009.Ss VALE SWITCH
1010A simple way to test the performance of a
1011.Nm VALE
1012switch is to attach a sender and a receiver to it,
1013e.g. running the following in two different terminals:
1014.Dl pkt-gen -i vale1:a -f rx # receiver
1015.Dl pkt-gen -i vale1:b -f tx # sender
1016The same example can be used to test netmap pipes, by simply
1017changing port names, e.g.
1018.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
1019.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
1020.Pp
1021The following command attaches an interface and the host stack
1022to a switch:
1023.Dl vale-ctl -h vale2:em0
1024Other
1025.Nm
1026clients attached to the same switch can now communicate
1027with the network card or the host.
1028.Sh SEE ALSO
1029.Pa http://info.iet.unipi.it/~luigi/netmap/
1030.Pp
1031Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1032Communications of the ACM, 55 (3), pp.45-51, March 2012
1033.Pp
1034Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1035Usenix ATC'12, June 2012, Boston
1036.Pp
1037Luigi Rizzo, Giuseppe Lettieri,
1038VALE, a switched ethernet for virtual machines,
1039ACM CoNEXT'12, December 2012, Nice
1040.Pp
1041Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1042Speeding up packet I/O in virtual machines,
1043ACM/IEEE ANCS'13, October 2013, San Jose
1044.Sh AUTHORS
1045.An -nosplit
1046The
1047.Nm
1048framework has been originally designed and implemented at the
1049Universita` di Pisa in 2011 by
1050.An Luigi Rizzo ,
1051and further extended with help from
1052.An Matteo Landi ,
1053.An Gaetano Catalli ,
1054.An Giuseppe Lettieri ,
1055and
1056.An Vincenzo Maffione .
1057.Pp
1058.Nm
1059and
1060.Nm VALE
1061have been funded by the European Commission within FP7 Projects
1062CHANGE (257422) and OPENLAB (287581).
1063.Sh CAVEATS
1064No matter how fast the CPU and OS are,
1065achieving line rate on 10G and faster interfaces
1066requires hardware with sufficient performance.
1067Several NICs are unable to sustain line rate with
1068small packet sizes.
1069Insufficient PCIe or memory bandwidth
1070can also cause reduced performance.
1071.Pp
1072Another frequent reason for low performance is the use
1073of flow control on the link: a slow receiver can limit
1074the transmit speed.
1075Be sure to disable flow control when running high
1076speed experiments.
1077.Ss SPECIAL NIC FEATURES
1078.Nm
1079is orthogonal to some NIC features such as
1080multiqueue, schedulers, packet filters.
1081.Pp
1082Multiple transmit and receive rings are supported natively
1083and can be configured with ordinary OS tools,
1084such as
1085.Xr ethtool
1086or
1087device-specific sysctl variables.
1088The same goes for Receive Packet Steering (RPS)
1089and filtering of incoming traffic.
1090.Pp
1091.Nm
1092.Em does not use
1093features such as
1094.Em checksum offloading , TCP segmentation offloading ,
1095.Em encryption , VLAN encapsulation/decapsulation ,
1096etc.
1097When using netmap to exchange packets with the host stack,
1098make sure to disable these features.
1099