xref: /freebsd/share/man/man4/netmap.4 (revision 0572ccaa4543b0abef8ef81e384c1d04de9f3da1)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd February 13, 2014
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.br
37.Nm VALE
38.Nd a fast VirtuAl Local Ethernet using the netmap API
39.br
40.Nm netmap pipes
41.Nd a shared memory packet transport channel
42.Sh SYNOPSIS
43.Cd device netmap
44.Sh DESCRIPTION
45.Nm
46is a framework for extremely fast and efficient packet I/O
47for both userspace and kernel clients.
48It runs on FreeBSD and Linux,
49and includes
50.Nm VALE ,
51a very fast and modular in-kernel software switch/dataplane,
52and
53.Nm netmap pipes ,
54a shared memory packet transport channel.
55All these are accessed interchangeably with the same API.
56.Pp
57.Nm , VALE
58and
59.Nm netmap pipes
60are at least one order of magnitude faster than
61standard OS mechanisms
62(sockets, bpf, tun/tap interfaces, native switches, pipes),
63reaching 14.88 million packets per second (Mpps)
64with much less than one core on a 10 Gbit NIC,
65about 20 Mpps per core for VALE ports,
66and over 100 Mpps for netmap pipes.
67.Pp
68Userspace clients can dynamically switch NICs into
69.Nm
70mode and send and receive raw packets through
71memory mapped buffers.
72Similarly,
73.Nm VALE
74switch instances and ports, and
75.Nm netmap pipes
76can be created dynamically,
77providing high speed packet I/O between processes,
78virtual machines, NICs and the host stack.
79.Pp
80.Nm
81suports both non-blocking I/O through
82.Xr ioctls() ,
83synchronization and blocking I/O through a file descriptor
84and standard OS mechanisms such as
85.Xr select 2 ,
86.Xr poll 2 ,
87.Xr epoll 2 ,
88.Xr kqueue 2 .
89.Nm VALE
90and
91.Nm netmap pipes
92are implemented by a single kernel module, which also emulates the
93.Nm
94API over standard drivers for devices without native
95.Nm
96support.
97For best performance,
98.Nm
99requires explicit support in device drivers.
100.Pp
101In the rest of this (long) manual page we document
102various aspects of the
103.Nm
104and
105.Nm VALE
106architecture, features and usage.
107.Pp
108.Sh ARCHITECTURE
109.Nm
110supports raw packet I/O through a
111.Em port ,
112which can be connected to a physical interface
113.Em ( NIC ) ,
114to the host stack,
115or to a
116.Nm VALE
117switch).
118Ports use preallocated circular queues of buffers
119.Em ( rings )
120residing in an mmapped region.
121There is one ring for each transmit/receive queue of a
122NIC or virtual port.
123An additional ring pair connects to the host stack.
124.Pp
125After binding a file descriptor to a port, a
126.Nm
127client can send or receive packets in batches through
128the rings, and possibly implement zero-copy forwarding
129between ports.
130.Pp
131All NICs operating in
132.Nm
133mode use the same memory region,
134accessible to all processes who own
135.Nm /dev/netmap
136file descriptors bound to NICs.
137Independent
138.Nm VALE
139and
140.Nm netmap pipe
141ports
142by default use separate memory regions,
143but can be independently configured to share memory.
144.Pp
145.Sh ENTERING AND EXITING NETMAP MODE
146The following section describes the system calls to create
147and control
148.Nm netmap
149ports (including
150.Nm VALE
151and
152.Nm netmap pipe
153ports).
154Simpler, higher level functions are described in section
155.Xr LIBRARIES .
156.Pp
157Ports and rings are created and controlled through a file descriptor,
158created by opening a special device
159.Dl fd = open("/dev/netmap");
160and then bound to a specific port with an
161.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
162.Pp
163.Nm
164has multiple modes of operation controlled by the
165.Vt struct nmreq
166argument.
167.Va arg.nr_name
168specifies the port name, as follows:
169.Bl -tag -width XXXX
170.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
171the data path of the NIC is disconnected from the host stack,
172and the file descriptor is bound to the NIC (one or all queues),
173or to the host stack;
174.It Dv valeXXX:YYY (arbitrary XXX and YYY)
175the file descriptor is bound to port YYY of a VALE switch called XXX,
176both dynamically created if necessary.
177The string cannot exceed IFNAMSIZ characters, and YYY cannot
178be the name of any existing OS network interface.
179.El
180.Pp
181On return,
182.Va arg
183indicates the size of the shared memory region,
184and the number, size and location of all the
185.Nm
186data structures, which can be accessed by mmapping the memory
187.Dl char *mem = mmap(0, arg.nr_memsize, fd);
188.Pp
189Non blocking I/O is done with special
190.Xr ioctl 2
191.Xr select 2
192and
193.Xr poll 2
194on the file descriptor permit blocking I/O.
195.Xr epoll 2
196and
197.Xr kqueue 2
198are not supported on
199.Nm
200file descriptors.
201.Pp
202While a NIC is in
203.Nm
204mode, the OS will still believe the interface is up and running.
205OS-generated packets for that NIC end up into a
206.Nm
207ring, and another ring is used to send packets into the OS network stack.
208A
209.Xr close 2
210on the file descriptor removes the binding,
211and returns the NIC to normal mode (reconnecting the data path
212to the host stack), or destroys the virtual port.
213.Pp
214.Sh DATA STRUCTURES
215The data structures in the mmapped memory region are detailed in
216.Xr sys/net/netmap.h ,
217which is the ultimate reference for the
218.Nm
219API. The main structures and fields are indicated below:
220.Bl -tag -width XXX
221.It Dv struct netmap_if (one per interface)
222.Bd -literal
223struct netmap_if {
224    ...
225    const uint32_t   ni_flags;      /* properties              */
226    ...
227    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
228    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
229    uint32_t         ni_bufs_head;  /* head of extra bufs list */
230    ...
231};
232.Ed
233.Pp
234Indicates the number of available rings
235.Pa ( struct netmap_rings )
236and their position in the mmapped region.
237The number of tx and rx rings
238.Pa ( ni_tx_rings , ni_rx_rings )
239normally depends on the hardware.
240NICs also have an extra tx/rx ring pair connected to the host stack.
241.Em NIOCREGIF
242can also request additional unbound buffers in the same memory space,
243to be used as temporary storage for packets.
244.Pa ni_bufs_head
245contains the index of the first of these free rings,
246which are connected in a list (the first uint32_t of each
247buffer being the index of the next buffer in the list).
248A 0 indicates the end of the list.
249.Pp
250.It Dv struct netmap_ring (one per ring)
251.Bd -literal
252struct netmap_ring {
253    ...
254    const uint32_t num_slots;   /* slots in each ring            */
255    const uint32_t nr_buf_size; /* size of each buffer           */
256    ...
257    uint32_t       head;        /* (u) first buf owned by user   */
258    uint32_t       cur;         /* (u) wakeup position           */
259    const uint32_t tail;        /* (k) first buf owned by kernel */
260    ...
261    uint32_t       flags;
262    struct timeval ts;          /* (k) time of last rxsync()     */
263    ...
264    struct netmap_slot slot[0]; /* array of slots                */
265}
266.Ed
267.Pp
268Implements transmit and receive rings, with read/write
269pointers, metadata and and an array of
270.Pa slots
271describing the buffers.
272.Pp
273.It Dv struct netmap_slot (one per buffer)
274.Bd -literal
275struct netmap_slot {
276    uint32_t buf_idx;           /* buffer index                 */
277    uint16_t len;               /* packet length                */
278    uint16_t flags;             /* buf changed, etc.            */
279    uint64_t ptr;               /* address for indirect buffers */
280};
281.Ed
282.Pp
283Describes a packet buffer, which normally is identified by
284an index and resides in the mmapped region.
285.It Dv packet buffers
286Fixed size (normally 2 KB) packet buffers allocated by the kernel.
287.El
288.Pp
289The offset of the
290.Pa struct netmap_if
291in the mmapped region is indicated by the
292.Pa nr_offset
293field in the structure returned by
294.Pa NIOCREGIF .
295From there, all other objects are reachable through
296relative references (offsets or indexes).
297Macros and functions in <net/netmap_user.h>
298help converting them into actual pointers:
299.Pp
300.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
301.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
302.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
303.Pp
304.Dl char *buf = NETMAP_BUF(ring, buffer_index);
305.Sh RINGS, BUFFERS AND DATA I/O
306.Va Rings
307are circular queues of packets with three indexes/pointers
308.Va ( head , cur , tail ) ;
309one slot is always kept empty.
310The ring size
311.Va ( num_slots )
312should not be assumed to be a power of two.
313.br
314(NOTE: older versions of netmap used head/count format to indicate
315the content of a ring).
316.Pp
317.Va head
318is the first slot available to userspace;
319.br
320.Va cur
321is the wakeup point:
322select/poll will unblock when
323.Va tail
324passes
325.Va cur ;
326.br
327.Va tail
328is the first slot reserved to the kernel.
329.Pp
330Slot indexes MUST only move forward;
331for convenience, the function
332.Dl nm_ring_next(ring, index)
333returns the next index modulo the ring size.
334.Pp
335.Va head
336and
337.Va cur
338are only modified by the user program;
339.Va tail
340is only modified by the kernel.
341The kernel only reads/writes the
342.Vt struct netmap_ring
343slots and buffers
344during the execution of a netmap-related system call.
345The only exception are slots (and buffers) in the range
346.Va tail\  . . . head-1 ,
347that are explicitly assigned to the kernel.
348.Pp
349.Ss TRANSMIT RINGS
350On transmit rings, after a
351.Nm
352system call, slots in the range
353.Va head\  . . . tail-1
354are available for transmission.
355User code should fill the slots sequentially
356and advance
357.Va head
358and
359.Va cur
360past slots ready to transmit.
361.Va cur
362may be moved further ahead if the user code needs
363more slots before further transmissions (see
364.Sx SCATTER GATHER I/O ) .
365.Pp
366At the next NIOCTXSYNC/select()/poll(),
367slots up to
368.Va head-1
369are pushed to the port, and
370.Va tail
371may advance if further slots have become available.
372Below is an example of the evolution of a TX ring:
373.Pp
374.Bd -literal
375    after the syscall, slots between cur and tail are (a)vailable
376              head=cur   tail
377               |          |
378               v          v
379     TX  [.....aaaaaaaaaaa.............]
380
381    user creates new packets to (T)ransmit
382                head=cur tail
383                    |     |
384                    v     v
385     TX  [.....TTTTTaaaaaa.............]
386
387    NIOCTXSYNC/poll()/select() sends packets and reports new slots
388                head=cur      tail
389                    |          |
390                    v          v
391     TX  [..........aaaaaaaaaaa........]
392.Ed
393.Pp
394select() and poll() wlll block if there is no space in the ring, i.e.
395.Dl ring->cur == ring->tail
396and return when new slots have become available.
397.Pp
398High speed applications may want to amortize the cost of system calls
399by preparing as many packets as possible before issuing them.
400.Pp
401A transmit ring with pending transmissions has
402.Dl ring->head != ring->tail + 1 (modulo the ring size).
403The function
404.Va int nm_tx_pending(ring)
405implements this test.
406.Pp
407.Ss RECEIVE RINGS
408On receive rings, after a
409.Nm
410system call, the slots in the range
411.Va head\& . . . tail-1
412contain received packets.
413User code should process them and advance
414.Va head
415and
416.Va cur
417past slots it wants to return to the kernel.
418.Va cur
419may be moved further ahead if the user code wants to
420wait for more packets
421without returning all the previous slots to the kernel.
422.Pp
423At the next NIOCRXSYNC/select()/poll(),
424slots up to
425.Va head-1
426are returned to the kernel for further receives, and
427.Va tail
428may advance to report new incoming packets.
429.br
430Below is an example of the evolution of an RX ring:
431.Bd -literal
432    after the syscall, there are some (h)eld and some (R)eceived slots
433           head  cur     tail
434            |     |       |
435            v     v       v
436     RX  [..hhhhhhRRRRRRRR..........]
437
438    user advances head and cur, releasing some slots and holding others
439               head cur  tail
440                 |  |     |
441                 v  v     v
442     RX  [..*****hhhRRRRRR...........]
443
444    NICRXSYNC/poll()/select() recovers slots and reports new packets
445               head cur        tail
446                 |  |           |
447                 v  v           v
448     RX  [.......hhhRRRRRRRRRRRR....]
449.Ed
450.Pp
451.Sh SLOTS AND PACKET BUFFERS
452Normally, packets should be stored in the netmap-allocated buffers
453assigned to slots when ports are bound to a file descriptor.
454One packet is fully contained in a single buffer.
455.Pp
456The following flags affect slot and buffer processing:
457.Bl -tag -width XXX
458.It NS_BUF_CHANGED
459it MUST be used when the buf_idx in the slot is changed.
460This can be used to implement
461zero-copy forwarding, see
462.Sx ZERO-COPY FORWARDING .
463.Pp
464.It NS_REPORT
465reports when this buffer has been transmitted.
466Normally,
467.Nm
468notifies transmit completions in batches, hence signals
469can be delayed indefinitely. This flag helps detecting
470when packets have been send and a file descriptor can be closed.
471.It NS_FORWARD
472When a ring is in 'transparent' mode (see
473.Sx TRANSPARENT MODE ) ,
474packets marked with this flags are forwarded to the other endpoint
475at the next system call, thus restoring (in a selective way)
476the connection between a NIC and the host stack.
477.It NS_NO_LEARN
478tells the forwarding code that the SRC MAC address for this
479packet must not be used in the learning bridge code.
480.It NS_INDIRECT
481indicates that the packet's payload is in a user-supplied buffer,
482whose user virtual address is in the 'ptr' field of the slot.
483The size can reach 65535 bytes.
484.br
485This is only supported on the transmit ring of
486.Nm VALE
487ports, and it helps reducing data copies in the interconnection
488of virtual machines.
489.It NS_MOREFRAG
490indicates that the packet continues with subsequent buffers;
491the last buffer in a packet must have the flag clear.
492.El
493.Sh SCATTER GATHER I/O
494Packets can span multiple slots if the
495.Va NS_MOREFRAG
496flag is set in all but the last slot.
497The maximum length of a chain is 64 buffers.
498This is normally used with
499.Nm VALE
500ports when connecting virtual machines, as they generate large
501TSO segments that are not split unless they reach a physical device.
502.Pp
503NOTE: The length field always refers to the individual
504fragment; there is no place with the total length of a packet.
505.Pp
506On receive rings the macro
507.Va NS_RFRAGS(slot)
508indicates the remaining number of slots for this packet,
509including the current one.
510Slots with a value greater than 1 also have NS_MOREFRAG set.
511.Sh IOCTLS
512.Nm
513uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
514for non-blocking I/O. They take no argument.
515Two more ioctls (NIOCGINFO, NIOCREGIF) are used
516to query and configure ports, with the following argument:
517.Bd -literal
518struct nmreq {
519    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
520    uint32_t  nr_version;        /* (i) API version                */
521    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
522    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
523    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
524    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
525    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
526    uint16_t  nr_rx_rings;       /* (i/o) number of tx rings       */
527    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
528    uint16_t  nr_cmd;            /* (i) special command            */
529    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
530    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
531    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
532    uint32_t  nr_flags           /* (i/o) open mode                */
533    ...
534};
535.Ed
536.Pp
537A file descriptor obtained through
538.Pa /dev/netmap
539also supports the ioctl supported by network devices, see
540.Xr netintro 4 .
541.Pp
542.Bl -tag -width XXXX
543.It Dv NIOCGINFO
544returns EINVAL if the named port does not support netmap.
545Otherwise, it returns 0 and (advisory) information
546about the port.
547Note that all the information below can change before the
548interface is actually put in netmap mode.
549.Pp
550.Bl -tag -width XX
551.It Pa nr_memsize
552indicates the size of the
553.Nm
554memory region. NICs in
555.Nm
556mode all share the same memory region,
557whereas
558.Nm VALE
559ports have independent regions for each port.
560.It Pa nr_tx_slots , nr_rx_slots
561indicate the size of transmit and receive rings.
562.It Pa nr_tx_rings , nr_rx_rings
563indicate the number of transmit
564and receive rings.
565Both ring number and sizes may be configured at runtime
566using interface-specific functions (e.g.
567.Xr ethtool
568).
569.El
570.It Dv NIOCREGIF
571binds the port named in
572.Va nr_name
573to the file descriptor. For a physical device this also switches it into
574.Nm
575mode, disconnecting
576it from the host stack.
577Multiple file descriptors can be bound to the same port,
578with proper synchronization left to the user.
579.Pp
580.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
581.Em netmap pipe ,
582consisting of two netmap ports with a crossover connection.
583A netmap pipe share the same memory space of the parent port,
584and is meant to enable configuration where a master process acts
585as a dispatcher towards slave processes.
586.Pp
587To enable this function, the
588.Pa nr_arg1
589field of the structure can be used as a hint to the kernel to
590indicate how many pipes we expect to use, and reserve extra space
591in the memory region.
592.Pp
593On return, it gives the same info as NIOCGINFO,
594with
595.Pa nr_ringid
596and
597.Pa nr_flags
598indicating the identity of the rings controlled through the file
599descriptor.
600.Pp
601.Va nr_flags
602.Va nr_ringid
603selects which rings are controlled through this file descriptor.
604Possible values of
605.Pa nr_flags
606are indicated below, together with the naming schemes
607that application libraries (such as the
608.Nm nm_open
609indicated below) can use to indicate the specific set of rings.
610In the example below, "netmap:foo" is any valid netmap port name.
611.Pp
612.Bl -tag -width XXXXX
613.It NR_REG_ALL_NIC                         "netmap:foo"
614(default) all hardware ring pairs
615.It NR_REG_SW_NIC           "netmap:foo^"
616the ``host rings'', connecting to the host stack.
617.It NR_RING_NIC_SW        "netmap:foo+
618all hardware rings and the host rings
619.It NR_REG_ONE_NIC       "netmap:foo-i"
620only the i-th hardware ring pair, where the number is in
621.Pa nr_ringid ;
622.It NR_REG_PIPE_MASTER  "netmap:foo{i"
623the master side of the netmap pipe whose identifier (i) is in
624.Pa nr_ringid ;
625.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
626the slave side of the netmap pipe whose identifier (i) is in
627.Pa nr_ringid .
628.Pp
629The identifier of a pipe must be thought as part of the pipe name,
630and does not need to be sequential. On return the pipe
631will only have a single ring pair with index 0,
632irrespective of the value of i.
633.El
634.Pp
635By default, a
636.Xr poll 2
637or
638.Xr select 2
639call pushes out any pending packets on the transmit ring, even if
640no write events are specified.
641The feature can be disabled by or-ing
642.Va NETMAP_NO_TX_SYNC
643to the value written to
644.Va nr_ringid.
645When this feature is used,
646packets are transmitted only on
647.Va ioctl(NIOCTXSYNC)
648or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
649.Pp
650When registering a virtual interface that is dynamically created to a
651.Xr vale 4
652switch, we can specify the desired number of rings (1 by default,
653and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
654.It Dv NIOCTXSYNC
655tells the hardware of new packets to transmit, and updates the
656number of slots available for transmission.
657.It Dv NIOCRXSYNC
658tells the hardware of consumed packets, and asks for newly available
659packets.
660.El
661.Sh SELECT, POLL, EPOLL, KQUEUE.
662.Xr select 2
663and
664.Xr poll 2
665on a
666.Nm
667file descriptor process rings as indicated in
668.Sx TRANSMIT RINGS
669and
670.Sx RECEIVE RINGS ,
671respectively when write (POLLOUT) and read (POLLIN) events are requested.
672Both block if no slots are available in the ring
673.Va ( ring->cur == ring->tail ) .
674Depending on the platform,
675.Xr epoll 2
676and
677.Xr kqueue 2
678are supported too.
679.Pp
680Packets in transmit rings are normally pushed out
681(and buffers reclaimed) even without
682requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
683.Em NIOCREGIF
684disables this feature.
685By default, receive rings are processed only if read
686events are requested. Passing the NETMAP_DO_RX_SYNC flag to
687.Em NIOCREGIF updates receive rings even without read events.
688Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC
689only have an effect when some event is posted for the file descriptor.
690.Sh LIBRARIES
691The
692.Nm
693API is supposed to be used directly, both because of its simplicity and
694for efficient integration with applications.
695.Pp
696For conveniency, the
697.Va <net/netmap_user.h>
698header provides a few macros and functions to ease creating
699a file descriptor and doing I/O with a
700.Nm
701port. These are loosely modeled after the
702.Xr pcap 3
703API, to ease porting of libpcap-based applications to
704.Nm .
705To use these extra functions, programs should
706.Dl #define NETMAP_WITH_LIBS
707before
708.Dl #include <net/netmap_user.h>
709.Pp
710The following functions are available:
711.Bl -tag -width XXXXX
712.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
713similar to
714.Xr pcap_open ,
715binds a file descriptor to a port.
716.Bl -tag -width XX
717.It Va ifname
718is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
719.Nm VALE
720port.
721.It Va req
722provides the initial values for the argument to the NIOCREGIF ioctl.
723The nm_flags and nm_ringid values are overwritten by parsing
724ifname and flags, and other fields can be overridden through
725the other two arguments.
726.It Va arg
727points to a struct nm_desc containing arguments (e.g. from a previously
728open file descriptor) that should override the defaults.
729The fields are used as described below
730.It Va flags
731can be set to a combination of the following flags:
732.Va NETMAP_NO_TX_POLL ,
733.Va NETMAP_DO_RX_POLL
734(copied into nr_ringid);
735.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
736avoids the mmap and uses the values from it);
737.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
738.Va NM_OPEN_ARG1 ,
739.Va NM_OPEN_ARG2 ,
740.Va NM_OPEN_ARG3 (uses the fields from arg);
741.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
742.El
743.It Va int nm_close(struct nm_desc *d)
744closes the file descriptor, unmaps memory, frees resources.
745.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
746similar to pcap_inject(), pushes a packet to a ring, returns the size
747of the packet is successful, or 0 on error;
748.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
749similar to pcap_dispatch(), applies a callback to incoming packets
750.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
751similar to pcap_next(), fetches the next packet
752.Pp
753.El
754.Sh SUPPORTED DEVICES
755.Nm
756natively supports the following devices:
757.Pp
758On FreeBSD:
759.Xr em 4 ,
760.Xr igb 4 ,
761.Xr ixgbe 4 ,
762.Xr lem 4 ,
763.Xr re 4 .
764.Pp
765On Linux
766.Xr e1000 4 ,
767.Xr e1000e 4 ,
768.Xr igb 4 ,
769.Xr ixgbe 4 ,
770.Xr mlx4 4 ,
771.Xr forcedeth 4 ,
772.Xr r8169 4 .
773.Pp
774NICs without native support can still be used in
775.Nm
776mode through emulation. Performance is inferior to native netmap
777mode but still significantly higher than sockets, and approaching
778that of in-kernel solutions such as Linux's
779.Xr pktgen .
780.Pp
781Emulation is also available for devices with native netmap support,
782which can be used for testing or performance comparison.
783The sysctl variable
784.Va dev.netmap.admode
785globally controls how netmap mode is implemented.
786.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
787Some aspect of the operation of
788.Nm
789are controlled through sysctl variables on FreeBSD
790.Em ( dev.netmap.* )
791and module parameters on Linux
792.Em ( /sys/module/netmap_lin/parameters/* ) :
793.Pp
794.Bl -tag -width indent
795.It Va dev.netmap.admode: 0
796Controls the use of native or emulated adapter mode.
7970 uses the best available option, 1 forces native and
798fails if not available, 2 forces emulated hence never fails.
799.It Va dev.netmap.generic_ringsize: 1024
800Ring size used for emulated netmap mode
801.It Va dev.netmap.generic_mit: 100000
802Controls interrupt moderation for emulated mode
803.It Va dev.netmap.mmap_unreg: 0
804.It Va dev.netmap.fwd: 0
805Forces NS_FORWARD mode
806.It Va dev.netmap.flags: 0
807.It Va dev.netmap.txsync_retry: 2
808.It Va dev.netmap.no_pendintr: 1
809Forces recovery of transmit buffers on system calls
810.It Va dev.netmap.mitigate: 1
811Propagates interrupt mitigation to user processes
812.It Va dev.netmap.no_timestamp: 0
813Disables the update of the timestamp in the netmap ring
814.It Va dev.netmap.verbose: 0
815Verbose kernel messages
816.It Va dev.netmap.buf_num: 163840
817.It Va dev.netmap.buf_size: 2048
818.It Va dev.netmap.ring_num: 200
819.It Va dev.netmap.ring_size: 36864
820.It Va dev.netmap.if_num: 100
821.It Va dev.netmap.if_size: 1024
822Sizes and number of objects (netmap_if, netmap_ring, buffers)
823for the global memory region. The only parameter worth modifying is
824.Va dev.netmap.buf_num
825as it impacts the total amount of memory used by netmap.
826.It Va dev.netmap.buf_curr_num: 0
827.It Va dev.netmap.buf_curr_size: 0
828.It Va dev.netmap.ring_curr_num: 0
829.It Va dev.netmap.ring_curr_size: 0
830.It Va dev.netmap.if_curr_num: 0
831.It Va dev.netmap.if_curr_size: 0
832Actual values in use.
833.It Va dev.netmap.bridge_batch: 1024
834Batch size used when moving packets across a
835.Nm VALE
836switch. Values above 64 generally guarantee good
837performance.
838.El
839.Sh SYSTEM CALLS
840.Nm
841uses
842.Xr select 2 ,
843.Xr poll 2 ,
844.Xr epoll
845and
846.Xr kqueue
847to wake up processes when significant events occur, and
848.Xr mmap 2
849to map memory.
850.Xr ioctl 2
851is used to configure ports and
852.Nm VALE switches .
853.Pp
854Applications may need to create threads and bind them to
855specific cores to improve performance, using standard
856OS primitives, see
857.Xr pthread 3 .
858In particular,
859.Xr pthread_setaffinity_np 3
860may be of use.
861.Sh CAVEATS
862No matter how fast the CPU and OS are,
863achieving line rate on 10G and faster interfaces
864requires hardware with sufficient performance.
865Several NICs are unable to sustain line rate with
866small packet sizes. Insufficient PCIe or memory bandwidth
867can also cause reduced performance.
868.Pp
869Another frequent reason for low performance is the use
870of flow control on the link: a slow receiver can limit
871the transmit speed.
872Be sure to disable flow control when running high
873speed experiments.
874.Pp
875.Ss SPECIAL NIC FEATURES
876.Nm
877is orthogonal to some NIC features such as
878multiqueue, schedulers, packet filters.
879.Pp
880Multiple transmit and receive rings are supported natively
881and can be configured with ordinary OS tools,
882such as
883.Xr ethtool
884or
885device-specific sysctl variables.
886The same goes for Receive Packet Steering (RPS)
887and filtering of incoming traffic.
888.Pp
889.Nm
890.Em does not use
891features such as
892.Em checksum offloading , TCP segmentation offloading ,
893.Em encryption , VLAN encapsulation/decapsulation ,
894etc. .
895When using netmap to exchange packets with the host stack,
896make sure to disable these features.
897.Sh EXAMPLES
898.Ss TEST PROGRAMS
899.Nm
900comes with a few programs that can be used for testing or
901simple applications.
902See the
903.Va examples/
904directory in
905.Nm
906distributions, or
907.Va tools/tools/netmap/
908directory in FreeBSD distributions.
909.Pp
910.Xr pkt-gen
911is a general purpose traffic source/sink.
912.Pp
913As an example
914.Dl pkt-gen -i ix0 -f tx -l 60
915can generate an infinite stream of minimum size packets, and
916.Dl pkt-gen -i ix0 -f rx
917is a traffic sink.
918Both print traffic statistics, to help monitor
919how the system performs.
920.Pp
921.Xr pkt-gen
922has many options can be uses to set packet sizes, addresses,
923rates, and use multiple send/receive threads and cores.
924.Pp
925.Xr bridge
926is another test program which interconnects two
927.Nm
928ports. It can be used for transparent forwarding between
929interfaces, as in
930.Dl bridge -i ix0 -i ix1
931or even connect the NIC to the host stack using netmap
932.Dl bridge -i ix0 -i ix0
933.Ss USING THE NATIVE API
934The following code implements a traffic generator
935.Pp
936.Bd -literal -compact
937#include <net/netmap_user.h>
938...
939void sender(void)
940{
941    struct netmap_if *nifp;
942    struct netmap_ring *ring;
943    struct nmreq nmr;
944    struct pollfd fds;
945
946    fd = open("/dev/netmap", O_RDWR);
947    bzero(&nmr, sizeof(nmr));
948    strcpy(nmr.nr_name, "ix0");
949    nmr.nm_version = NETMAP_API;
950    ioctl(fd, NIOCREGIF, &nmr);
951    p = mmap(0, nmr.nr_memsize, fd);
952    nifp = NETMAP_IF(p, nmr.nr_offset);
953    ring = NETMAP_TXRING(nifp, 0);
954    fds.fd = fd;
955    fds.events = POLLOUT;
956    for (;;) {
957	poll(&fds, 1, -1);
958	while (!nm_ring_empty(ring)) {
959	    i = ring->cur;
960	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
961	    ... prepare packet in buf ...
962	    ring->slot[i].len = ... packet length ...
963	    ring->head = ring->cur = nm_ring_next(ring, i);
964	}
965    }
966}
967.Ed
968.Ss HELPER FUNCTIONS
969A simple receiver can be implemented using the helper functions
970.Bd -literal -compact
971#define NETMAP_WITH_LIBS
972#include <net/netmap_user.h>
973...
974void receiver(void)
975{
976    struct nm_desc *d;
977    struct pollfd fds;
978    u_char *buf;
979    struct nm_pkthdr h;
980    ...
981    d = nm_open("netmap:ix0", NULL, 0, 0);
982    fds.fd = NETMAP_FD(d);
983    fds.events = POLLIN;
984    for (;;) {
985	poll(&fds, 1, -1);
986        while ( (buf = nm_nextpkt(d, &h)) )
987	    consume_pkt(buf, h->len);
988    }
989    nm_close(d);
990}
991.Ed
992.Ss ZERO-COPY FORWARDING
993Since physical interfaces share the same memory region,
994it is possible to do packet forwarding between ports
995swapping buffers. The buffer from the transmit ring is used
996to replenish the receive ring:
997.Bd -literal -compact
998    uint32_t tmp;
999    struct netmap_slot *src, *dst;
1000    ...
1001    src = &src_ring->slot[rxr->cur];
1002    dst = &dst_ring->slot[txr->cur];
1003    tmp = dst->buf_idx;
1004    dst->buf_idx = src->buf_idx;
1005    dst->len = src->len;
1006    dst->flags = NS_BUF_CHANGED;
1007    src->buf_idx = tmp;
1008    src->flags = NS_BUF_CHANGED;
1009    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1010    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1011    ...
1012.Ed
1013.Ss ACCESSING THE HOST STACK
1014The host stack is for all practical purposes just a regular ring pair,
1015which you can access with the netmap API (e.g. with
1016.Dl nm_open("netmap:eth0^", ... ) ;
1017All packets that the host would send to an interface in
1018.Nm
1019mode end up into the RX ring, whereas all packets queued to the
1020TX ring are send up to the host stack.
1021.Ss VALE SWITCH
1022A simple way to test the performance of a
1023.Nm VALE
1024switch is to attach a sender and a receiver to it,
1025e.g. running the following in two different terminals:
1026.Dl pkt-gen -i vale1:a -f rx # receiver
1027.Dl pkt-gen -i vale1:b -f tx # sender
1028The same example can be used to test netmap pipes, by simply
1029changing port names, e.g.
1030.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
1031.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
1032.Pp
1033The following command attaches an interface and the host stack
1034to a switch:
1035.Dl vale-ctl -h vale2:em0
1036Other
1037.Nm
1038clients attached to the same switch can now communicate
1039with the network card or the host.
1040.Pp
1041.Sh SEE ALSO
1042.Pp
1043http://info.iet.unipi.it/~luigi/netmap/
1044.Pp
1045Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1046Communications of the ACM, 55 (3), pp.45-51, March 2012
1047.Pp
1048Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1049Usenix ATC'12, June 2012, Boston
1050.Pp
1051Luigi Rizzo, Giuseppe Lettieri,
1052VALE, a switched ethernet for virtual machines,
1053ACM CoNEXT'12, December 2012, Nice
1054.Pp
1055Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1056Speeding up packet I/O in virtual machines,
1057ACM/IEEE ANCS'13, October 2013, San Jose
1058.Sh AUTHORS
1059.An -nosplit
1060The
1061.Nm
1062framework has been originally designed and implemented at the
1063Universita` di Pisa in 2011 by
1064.An Luigi Rizzo ,
1065and further extended with help from
1066.An Matteo Landi ,
1067.An Gaetano Catalli ,
1068.An Giuseppe Lettieri ,
1069.An Vincenzo Maffione .
1070.Pp
1071.Nm
1072and
1073.Nm VALE
1074have been funded by the European Commission within FP7 Projects
1075CHANGE (257422) and OPENLAB (287581).
1076