xref: /freebsd/share/man/man4/netmap.4 (revision ec0e626bafb335b30c499d06066997f54b10c092)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd February 13, 2014
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.br
37.Nm VALE
38.Nd a fast VirtuAl Local Ethernet using the netmap API
39.br
40.Nm netmap pipes
41.Nd a shared memory packet transport channel
42.Sh SYNOPSIS
43.Cd device netmap
44.Sh DESCRIPTION
45.Nm
46is a framework for extremely fast and efficient packet I/O
47for both userspace and kernel clients.
48It runs on FreeBSD and Linux,
49and includes
50.Nm VALE ,
51a very fast and modular in-kernel software switch/dataplane,
52and
53.Nm netmap pipes ,
54a shared memory packet transport channel.
55All these are accessed interchangeably with the same API.
56.Pp
57.Nm , VALE
58and
59.Nm netmap pipes
60are at least one order of magnitude faster than
61standard OS mechanisms
62(sockets, bpf, tun/tap interfaces, native switches, pipes),
63reaching 14.88 million packets per second (Mpps)
64with much less than one core on a 10 Gbit NIC,
65about 20 Mpps per core for VALE ports,
66and over 100 Mpps for netmap pipes.
67.Pp
68Userspace clients can dynamically switch NICs into
69.Nm
70mode and send and receive raw packets through
71memory mapped buffers.
72Similarly,
73.Nm VALE
74switch instances and ports, and
75.Nm netmap pipes
76can be created dynamically,
77providing high speed packet I/O between processes,
78virtual machines, NICs and the host stack.
79.Pp
80.Nm
81suports both non-blocking I/O through
82.Xr ioctls() ,
83synchronization and blocking I/O through a file descriptor
84and standard OS mechanisms such as
85.Xr select 2 ,
86.Xr poll 2 ,
87.Xr epoll 2 ,
88.Xr kqueue 2 .
89.Nm VALE
90and
91.Nm netmap pipes
92are implemented by a single kernel module, which also emulates the
93.Nm
94API over standard drivers for devices without native
95.Nm
96support.
97For best performance,
98.Nm
99requires explicit support in device drivers.
100.Pp
101In the rest of this (long) manual page we document
102various aspects of the
103.Nm
104and
105.Nm VALE
106architecture, features and usage.
107.Sh ARCHITECTURE
108.Nm
109supports raw packet I/O through a
110.Em port ,
111which can be connected to a physical interface
112.Em ( NIC ) ,
113to the host stack,
114or to a
115.Nm VALE
116switch).
117Ports use preallocated circular queues of buffers
118.Em ( rings )
119residing in an mmapped region.
120There is one ring for each transmit/receive queue of a
121NIC or virtual port.
122An additional ring pair connects to the host stack.
123.Pp
124After binding a file descriptor to a port, a
125.Nm
126client can send or receive packets in batches through
127the rings, and possibly implement zero-copy forwarding
128between ports.
129.Pp
130All NICs operating in
131.Nm
132mode use the same memory region,
133accessible to all processes who own
134.Nm /dev/netmap
135file descriptors bound to NICs.
136Independent
137.Nm VALE
138and
139.Nm netmap pipe
140ports
141by default use separate memory regions,
142but can be independently configured to share memory.
143.Sh ENTERING AND EXITING NETMAP MODE
144The following section describes the system calls to create
145and control
146.Nm netmap
147ports (including
148.Nm VALE
149and
150.Nm netmap pipe
151ports).
152Simpler, higher level functions are described in section
153.Xr LIBRARIES .
154.Pp
155Ports and rings are created and controlled through a file descriptor,
156created by opening a special device
157.Dl fd = open("/dev/netmap");
158and then bound to a specific port with an
159.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
160.Pp
161.Nm
162has multiple modes of operation controlled by the
163.Vt struct nmreq
164argument.
165.Va arg.nr_name
166specifies the port name, as follows:
167.Bl -tag -width XXXX
168.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
169the data path of the NIC is disconnected from the host stack,
170and the file descriptor is bound to the NIC (one or all queues),
171or to the host stack;
172.It Dv valeXXX:YYY (arbitrary XXX and YYY)
173the file descriptor is bound to port YYY of a VALE switch called XXX,
174both dynamically created if necessary.
175The string cannot exceed IFNAMSIZ characters, and YYY cannot
176be the name of any existing OS network interface.
177.El
178.Pp
179On return,
180.Va arg
181indicates the size of the shared memory region,
182and the number, size and location of all the
183.Nm
184data structures, which can be accessed by mmapping the memory
185.Dl char *mem = mmap(0, arg.nr_memsize, fd);
186.Pp
187Non blocking I/O is done with special
188.Xr ioctl 2
189.Xr select 2
190and
191.Xr poll 2
192on the file descriptor permit blocking I/O.
193.Xr epoll 2
194and
195.Xr kqueue 2
196are not supported on
197.Nm
198file descriptors.
199.Pp
200While a NIC is in
201.Nm
202mode, the OS will still believe the interface is up and running.
203OS-generated packets for that NIC end up into a
204.Nm
205ring, and another ring is used to send packets into the OS network stack.
206A
207.Xr close 2
208on the file descriptor removes the binding,
209and returns the NIC to normal mode (reconnecting the data path
210to the host stack), or destroys the virtual port.
211.Sh DATA STRUCTURES
212The data structures in the mmapped memory region are detailed in
213.Xr sys/net/netmap.h ,
214which is the ultimate reference for the
215.Nm
216API. The main structures and fields are indicated below:
217.Bl -tag -width XXX
218.It Dv struct netmap_if (one per interface)
219.Bd -literal
220struct netmap_if {
221    ...
222    const uint32_t   ni_flags;      /* properties              */
223    ...
224    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
225    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
226    uint32_t         ni_bufs_head;  /* head of extra bufs list */
227    ...
228};
229.Ed
230.Pp
231Indicates the number of available rings
232.Pa ( struct netmap_rings )
233and their position in the mmapped region.
234The number of tx and rx rings
235.Pa ( ni_tx_rings , ni_rx_rings )
236normally depends on the hardware.
237NICs also have an extra tx/rx ring pair connected to the host stack.
238.Em NIOCREGIF
239can also request additional unbound buffers in the same memory space,
240to be used as temporary storage for packets.
241.Pa ni_bufs_head
242contains the index of the first of these free rings,
243which are connected in a list (the first uint32_t of each
244buffer being the index of the next buffer in the list).
245A 0 indicates the end of the list.
246.It Dv struct netmap_ring (one per ring)
247.Bd -literal
248struct netmap_ring {
249    ...
250    const uint32_t num_slots;   /* slots in each ring            */
251    const uint32_t nr_buf_size; /* size of each buffer           */
252    ...
253    uint32_t       head;        /* (u) first buf owned by user   */
254    uint32_t       cur;         /* (u) wakeup position           */
255    const uint32_t tail;        /* (k) first buf owned by kernel */
256    ...
257    uint32_t       flags;
258    struct timeval ts;          /* (k) time of last rxsync()     */
259    ...
260    struct netmap_slot slot[0]; /* array of slots                */
261}
262.Ed
263.Pp
264Implements transmit and receive rings, with read/write
265pointers, metadata and and an array of
266.Pa slots
267describing the buffers.
268.It Dv struct netmap_slot (one per buffer)
269.Bd -literal
270struct netmap_slot {
271    uint32_t buf_idx;           /* buffer index                 */
272    uint16_t len;               /* packet length                */
273    uint16_t flags;             /* buf changed, etc.            */
274    uint64_t ptr;               /* address for indirect buffers */
275};
276.Ed
277.Pp
278Describes a packet buffer, which normally is identified by
279an index and resides in the mmapped region.
280.It Dv packet buffers
281Fixed size (normally 2 KB) packet buffers allocated by the kernel.
282.El
283.Pp
284The offset of the
285.Pa struct netmap_if
286in the mmapped region is indicated by the
287.Pa nr_offset
288field in the structure returned by
289.Pa NIOCREGIF .
290From there, all other objects are reachable through
291relative references (offsets or indexes).
292Macros and functions in <net/netmap_user.h>
293help converting them into actual pointers:
294.Pp
295.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
296.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
297.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
298.Pp
299.Dl char *buf = NETMAP_BUF(ring, buffer_index);
300.Sh RINGS, BUFFERS AND DATA I/O
301.Va Rings
302are circular queues of packets with three indexes/pointers
303.Va ( head , cur , tail ) ;
304one slot is always kept empty.
305The ring size
306.Va ( num_slots )
307should not be assumed to be a power of two.
308.br
309(NOTE: older versions of netmap used head/count format to indicate
310the content of a ring).
311.Pp
312.Va head
313is the first slot available to userspace;
314.br
315.Va cur
316is the wakeup point:
317select/poll will unblock when
318.Va tail
319passes
320.Va cur ;
321.br
322.Va tail
323is the first slot reserved to the kernel.
324.Pp
325Slot indexes MUST only move forward;
326for convenience, the function
327.Dl nm_ring_next(ring, index)
328returns the next index modulo the ring size.
329.Pp
330.Va head
331and
332.Va cur
333are only modified by the user program;
334.Va tail
335is only modified by the kernel.
336The kernel only reads/writes the
337.Vt struct netmap_ring
338slots and buffers
339during the execution of a netmap-related system call.
340The only exception are slots (and buffers) in the range
341.Va tail\  . . . head-1 ,
342that are explicitly assigned to the kernel.
343.Pp
344.Ss TRANSMIT RINGS
345On transmit rings, after a
346.Nm
347system call, slots in the range
348.Va head\  . . . tail-1
349are available for transmission.
350User code should fill the slots sequentially
351and advance
352.Va head
353and
354.Va cur
355past slots ready to transmit.
356.Va cur
357may be moved further ahead if the user code needs
358more slots before further transmissions (see
359.Sx SCATTER GATHER I/O ) .
360.Pp
361At the next NIOCTXSYNC/select()/poll(),
362slots up to
363.Va head-1
364are pushed to the port, and
365.Va tail
366may advance if further slots have become available.
367Below is an example of the evolution of a TX ring:
368.Bd -literal
369    after the syscall, slots between cur and tail are (a)vailable
370              head=cur   tail
371               |          |
372               v          v
373     TX  [.....aaaaaaaaaaa.............]
374
375    user creates new packets to (T)ransmit
376                head=cur tail
377                    |     |
378                    v     v
379     TX  [.....TTTTTaaaaaa.............]
380
381    NIOCTXSYNC/poll()/select() sends packets and reports new slots
382                head=cur      tail
383                    |          |
384                    v          v
385     TX  [..........aaaaaaaaaaa........]
386.Ed
387.Pp
388select() and poll() wlll block if there is no space in the ring, i.e.
389.Dl ring->cur == ring->tail
390and return when new slots have become available.
391.Pp
392High speed applications may want to amortize the cost of system calls
393by preparing as many packets as possible before issuing them.
394.Pp
395A transmit ring with pending transmissions has
396.Dl ring->head != ring->tail + 1 (modulo the ring size).
397The function
398.Va int nm_tx_pending(ring)
399implements this test.
400.Ss RECEIVE RINGS
401On receive rings, after a
402.Nm
403system call, the slots in the range
404.Va head\& . . . tail-1
405contain received packets.
406User code should process them and advance
407.Va head
408and
409.Va cur
410past slots it wants to return to the kernel.
411.Va cur
412may be moved further ahead if the user code wants to
413wait for more packets
414without returning all the previous slots to the kernel.
415.Pp
416At the next NIOCRXSYNC/select()/poll(),
417slots up to
418.Va head-1
419are returned to the kernel for further receives, and
420.Va tail
421may advance to report new incoming packets.
422.br
423Below is an example of the evolution of an RX ring:
424.Bd -literal
425    after the syscall, there are some (h)eld and some (R)eceived slots
426           head  cur     tail
427            |     |       |
428            v     v       v
429     RX  [..hhhhhhRRRRRRRR..........]
430
431    user advances head and cur, releasing some slots and holding others
432               head cur  tail
433                 |  |     |
434                 v  v     v
435     RX  [..*****hhhRRRRRR...........]
436
437    NICRXSYNC/poll()/select() recovers slots and reports new packets
438               head cur        tail
439                 |  |           |
440                 v  v           v
441     RX  [.......hhhRRRRRRRRRRRR....]
442.Ed
443.Sh SLOTS AND PACKET BUFFERS
444Normally, packets should be stored in the netmap-allocated buffers
445assigned to slots when ports are bound to a file descriptor.
446One packet is fully contained in a single buffer.
447.Pp
448The following flags affect slot and buffer processing:
449.Bl -tag -width XXX
450.It NS_BUF_CHANGED
451it MUST be used when the buf_idx in the slot is changed.
452This can be used to implement
453zero-copy forwarding, see
454.Sx ZERO-COPY FORWARDING .
455.It NS_REPORT
456reports when this buffer has been transmitted.
457Normally,
458.Nm
459notifies transmit completions in batches, hence signals
460can be delayed indefinitely. This flag helps detecting
461when packets have been send and a file descriptor can be closed.
462.It NS_FORWARD
463When a ring is in 'transparent' mode (see
464.Sx TRANSPARENT MODE ) ,
465packets marked with this flags are forwarded to the other endpoint
466at the next system call, thus restoring (in a selective way)
467the connection between a NIC and the host stack.
468.It NS_NO_LEARN
469tells the forwarding code that the SRC MAC address for this
470packet must not be used in the learning bridge code.
471.It NS_INDIRECT
472indicates that the packet's payload is in a user-supplied buffer,
473whose user virtual address is in the 'ptr' field of the slot.
474The size can reach 65535 bytes.
475.br
476This is only supported on the transmit ring of
477.Nm VALE
478ports, and it helps reducing data copies in the interconnection
479of virtual machines.
480.It NS_MOREFRAG
481indicates that the packet continues with subsequent buffers;
482the last buffer in a packet must have the flag clear.
483.El
484.Sh SCATTER GATHER I/O
485Packets can span multiple slots if the
486.Va NS_MOREFRAG
487flag is set in all but the last slot.
488The maximum length of a chain is 64 buffers.
489This is normally used with
490.Nm VALE
491ports when connecting virtual machines, as they generate large
492TSO segments that are not split unless they reach a physical device.
493.Pp
494NOTE: The length field always refers to the individual
495fragment; there is no place with the total length of a packet.
496.Pp
497On receive rings the macro
498.Va NS_RFRAGS(slot)
499indicates the remaining number of slots for this packet,
500including the current one.
501Slots with a value greater than 1 also have NS_MOREFRAG set.
502.Sh IOCTLS
503.Nm
504uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
505for non-blocking I/O. They take no argument.
506Two more ioctls (NIOCGINFO, NIOCREGIF) are used
507to query and configure ports, with the following argument:
508.Bd -literal
509struct nmreq {
510    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
511    uint32_t  nr_version;        /* (i) API version                */
512    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
513    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
514    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
515    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
516    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
517    uint16_t  nr_rx_rings;       /* (i/o) number of tx rings       */
518    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
519    uint16_t  nr_cmd;            /* (i) special command            */
520    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
521    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
522    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
523    uint32_t  nr_flags           /* (i/o) open mode                */
524    ...
525};
526.Ed
527.Pp
528A file descriptor obtained through
529.Pa /dev/netmap
530also supports the ioctl supported by network devices, see
531.Xr netintro 4 .
532.Bl -tag -width XXXX
533.It Dv NIOCGINFO
534returns EINVAL if the named port does not support netmap.
535Otherwise, it returns 0 and (advisory) information
536about the port.
537Note that all the information below can change before the
538interface is actually put in netmap mode.
539.Bl -tag -width XX
540.It Pa nr_memsize
541indicates the size of the
542.Nm
543memory region. NICs in
544.Nm
545mode all share the same memory region,
546whereas
547.Nm VALE
548ports have independent regions for each port.
549.It Pa nr_tx_slots , nr_rx_slots
550indicate the size of transmit and receive rings.
551.It Pa nr_tx_rings , nr_rx_rings
552indicate the number of transmit
553and receive rings.
554Both ring number and sizes may be configured at runtime
555using interface-specific functions (e.g.
556.Xr ethtool
557).
558.El
559.It Dv NIOCREGIF
560binds the port named in
561.Va nr_name
562to the file descriptor. For a physical device this also switches it into
563.Nm
564mode, disconnecting
565it from the host stack.
566Multiple file descriptors can be bound to the same port,
567with proper synchronization left to the user.
568.Pp
569.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
570.Em netmap pipe ,
571consisting of two netmap ports with a crossover connection.
572A netmap pipe share the same memory space of the parent port,
573and is meant to enable configuration where a master process acts
574as a dispatcher towards slave processes.
575.Pp
576To enable this function, the
577.Pa nr_arg1
578field of the structure can be used as a hint to the kernel to
579indicate how many pipes we expect to use, and reserve extra space
580in the memory region.
581.Pp
582On return, it gives the same info as NIOCGINFO,
583with
584.Pa nr_ringid
585and
586.Pa nr_flags
587indicating the identity of the rings controlled through the file
588descriptor.
589.Pp
590.Va nr_flags
591.Va nr_ringid
592selects which rings are controlled through this file descriptor.
593Possible values of
594.Pa nr_flags
595are indicated below, together with the naming schemes
596that application libraries (such as the
597.Nm nm_open
598indicated below) can use to indicate the specific set of rings.
599In the example below, "netmap:foo" is any valid netmap port name.
600.Bl -tag -width XXXXX
601.It NR_REG_ALL_NIC                         "netmap:foo"
602(default) all hardware ring pairs
603.It NR_REG_SW            "netmap:foo^"
604the ``host rings'', connecting to the host stack.
605.It NR_REG_NIC_SW        "netmap:foo+"
606all hardware rings and the host rings
607.It NR_REG_ONE_NIC       "netmap:foo-i"
608only the i-th hardware ring pair, where the number is in
609.Pa nr_ringid ;
610.It NR_REG_PIPE_MASTER  "netmap:foo{i"
611the master side of the netmap pipe whose identifier (i) is in
612.Pa nr_ringid ;
613.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
614the slave side of the netmap pipe whose identifier (i) is in
615.Pa nr_ringid .
616.Pp
617The identifier of a pipe must be thought as part of the pipe name,
618and does not need to be sequential. On return the pipe
619will only have a single ring pair with index 0,
620irrespective of the value of i.
621.El
622.Pp
623By default, a
624.Xr poll 2
625or
626.Xr select 2
627call pushes out any pending packets on the transmit ring, even if
628no write events are specified.
629The feature can be disabled by or-ing
630.Va NETMAP_NO_TX_POLL
631to the value written to
632.Va nr_ringid.
633When this feature is used,
634packets are transmitted only on
635.Va ioctl(NIOCTXSYNC)
636or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
637.Pp
638When registering a virtual interface that is dynamically created to a
639.Xr vale 4
640switch, we can specify the desired number of rings (1 by default,
641and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
642.It Dv NIOCTXSYNC
643tells the hardware of new packets to transmit, and updates the
644number of slots available for transmission.
645.It Dv NIOCRXSYNC
646tells the hardware of consumed packets, and asks for newly available
647packets.
648.El
649.Sh SELECT, POLL, EPOLL, KQUEUE.
650.Xr select 2
651and
652.Xr poll 2
653on a
654.Nm
655file descriptor process rings as indicated in
656.Sx TRANSMIT RINGS
657and
658.Sx RECEIVE RINGS ,
659respectively when write (POLLOUT) and read (POLLIN) events are requested.
660Both block if no slots are available in the ring
661.Va ( ring->cur == ring->tail ) .
662Depending on the platform,
663.Xr epoll 2
664and
665.Xr kqueue 2
666are supported too.
667.Pp
668Packets in transmit rings are normally pushed out
669(and buffers reclaimed) even without
670requesting write events. Passing the NETMAP_NO_TX_POLL flag to
671.Em NIOCREGIF
672disables this feature.
673By default, receive rings are processed only if read
674events are requested. Passing the NETMAP_DO_RX_POLL flag to
675.Em NIOCREGIF updates receive rings even without read events.
676Note that on epoll and kqueue, NETMAP_NO_TX_POLL and NETMAP_DO_RX_POLL
677only have an effect when some event is posted for the file descriptor.
678.Sh LIBRARIES
679The
680.Nm
681API is supposed to be used directly, both because of its simplicity and
682for efficient integration with applications.
683.Pp
684For conveniency, the
685.Va <net/netmap_user.h>
686header provides a few macros and functions to ease creating
687a file descriptor and doing I/O with a
688.Nm
689port. These are loosely modeled after the
690.Xr pcap 3
691API, to ease porting of libpcap-based applications to
692.Nm .
693To use these extra functions, programs should
694.Dl #define NETMAP_WITH_LIBS
695before
696.Dl #include <net/netmap_user.h>
697.Pp
698The following functions are available:
699.Bl -tag -width XXXXX
700.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
701similar to
702.Xr pcap_open ,
703binds a file descriptor to a port.
704.Bl -tag -width XX
705.It Va ifname
706is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
707.Nm VALE
708port.
709.It Va req
710provides the initial values for the argument to the NIOCREGIF ioctl.
711The nm_flags and nm_ringid values are overwritten by parsing
712ifname and flags, and other fields can be overridden through
713the other two arguments.
714.It Va arg
715points to a struct nm_desc containing arguments (e.g. from a previously
716open file descriptor) that should override the defaults.
717The fields are used as described below
718.It Va flags
719can be set to a combination of the following flags:
720.Va NETMAP_NO_TX_POLL ,
721.Va NETMAP_DO_RX_POLL
722(copied into nr_ringid);
723.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
724avoids the mmap and uses the values from it);
725.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
726.Va NM_OPEN_ARG1 ,
727.Va NM_OPEN_ARG2 ,
728.Va NM_OPEN_ARG3 (uses the fields from arg);
729.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
730.El
731.It Va int nm_close(struct nm_desc *d)
732closes the file descriptor, unmaps memory, frees resources.
733.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
734similar to pcap_inject(), pushes a packet to a ring, returns the size
735of the packet is successful, or 0 on error;
736.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
737similar to pcap_dispatch(), applies a callback to incoming packets
738.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
739similar to pcap_next(), fetches the next packet
740.El
741.Sh SUPPORTED DEVICES
742.Nm
743natively supports the following devices:
744.Pp
745On FreeBSD:
746.Xr em 4 ,
747.Xr igb 4 ,
748.Xr ixgbe 4 ,
749.Xr lem 4 ,
750.Xr re 4 .
751.Pp
752On Linux
753.Xr e1000 4 ,
754.Xr e1000e 4 ,
755.Xr igb 4 ,
756.Xr ixgbe 4 ,
757.Xr mlx4 4 ,
758.Xr forcedeth 4 ,
759.Xr r8169 4 .
760.Pp
761NICs without native support can still be used in
762.Nm
763mode through emulation. Performance is inferior to native netmap
764mode but still significantly higher than sockets, and approaching
765that of in-kernel solutions such as Linux's
766.Xr pktgen .
767.Pp
768Emulation is also available for devices with native netmap support,
769which can be used for testing or performance comparison.
770The sysctl variable
771.Va dev.netmap.admode
772globally controls how netmap mode is implemented.
773.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
774Some aspect of the operation of
775.Nm
776are controlled through sysctl variables on FreeBSD
777.Em ( dev.netmap.* )
778and module parameters on Linux
779.Em ( /sys/module/netmap_lin/parameters/* ) :
780.Bl -tag -width indent
781.It Va dev.netmap.admode: 0
782Controls the use of native or emulated adapter mode.
7830 uses the best available option, 1 forces native and
784fails if not available, 2 forces emulated hence never fails.
785.It Va dev.netmap.generic_ringsize: 1024
786Ring size used for emulated netmap mode
787.It Va dev.netmap.generic_mit: 100000
788Controls interrupt moderation for emulated mode
789.It Va dev.netmap.mmap_unreg: 0
790.It Va dev.netmap.fwd: 0
791Forces NS_FORWARD mode
792.It Va dev.netmap.flags: 0
793.It Va dev.netmap.txsync_retry: 2
794.It Va dev.netmap.no_pendintr: 1
795Forces recovery of transmit buffers on system calls
796.It Va dev.netmap.mitigate: 1
797Propagates interrupt mitigation to user processes
798.It Va dev.netmap.no_timestamp: 0
799Disables the update of the timestamp in the netmap ring
800.It Va dev.netmap.verbose: 0
801Verbose kernel messages
802.It Va dev.netmap.buf_num: 163840
803.It Va dev.netmap.buf_size: 2048
804.It Va dev.netmap.ring_num: 200
805.It Va dev.netmap.ring_size: 36864
806.It Va dev.netmap.if_num: 100
807.It Va dev.netmap.if_size: 1024
808Sizes and number of objects (netmap_if, netmap_ring, buffers)
809for the global memory region. The only parameter worth modifying is
810.Va dev.netmap.buf_num
811as it impacts the total amount of memory used by netmap.
812.It Va dev.netmap.buf_curr_num: 0
813.It Va dev.netmap.buf_curr_size: 0
814.It Va dev.netmap.ring_curr_num: 0
815.It Va dev.netmap.ring_curr_size: 0
816.It Va dev.netmap.if_curr_num: 0
817.It Va dev.netmap.if_curr_size: 0
818Actual values in use.
819.It Va dev.netmap.bridge_batch: 1024
820Batch size used when moving packets across a
821.Nm VALE
822switch. Values above 64 generally guarantee good
823performance.
824.El
825.Sh SYSTEM CALLS
826.Nm
827uses
828.Xr select 2 ,
829.Xr poll 2 ,
830.Xr epoll
831and
832.Xr kqueue
833to wake up processes when significant events occur, and
834.Xr mmap 2
835to map memory.
836.Xr ioctl 2
837is used to configure ports and
838.Nm VALE switches .
839.Pp
840Applications may need to create threads and bind them to
841specific cores to improve performance, using standard
842OS primitives, see
843.Xr pthread 3 .
844In particular,
845.Xr pthread_setaffinity_np 3
846may be of use.
847.Sh EXAMPLES
848.Ss TEST PROGRAMS
849.Nm
850comes with a few programs that can be used for testing or
851simple applications.
852See the
853.Va examples/
854directory in
855.Nm
856distributions, or
857.Va tools/tools/netmap/
858directory in FreeBSD distributions.
859.Pp
860.Xr pkt-gen
861is a general purpose traffic source/sink.
862.Pp
863As an example
864.Dl pkt-gen -i ix0 -f tx -l 60
865can generate an infinite stream of minimum size packets, and
866.Dl pkt-gen -i ix0 -f rx
867is a traffic sink.
868Both print traffic statistics, to help monitor
869how the system performs.
870.Pp
871.Xr pkt-gen
872has many options can be uses to set packet sizes, addresses,
873rates, and use multiple send/receive threads and cores.
874.Pp
875.Xr bridge
876is another test program which interconnects two
877.Nm
878ports. It can be used for transparent forwarding between
879interfaces, as in
880.Dl bridge -i ix0 -i ix1
881or even connect the NIC to the host stack using netmap
882.Dl bridge -i ix0 -i ix0
883.Ss USING THE NATIVE API
884The following code implements a traffic generator
885.Pp
886.Bd -literal -compact
887#include <net/netmap_user.h>
888\&...
889void sender(void)
890{
891    struct netmap_if *nifp;
892    struct netmap_ring *ring;
893    struct nmreq nmr;
894    struct pollfd fds;
895
896    fd = open("/dev/netmap", O_RDWR);
897    bzero(&nmr, sizeof(nmr));
898    strcpy(nmr.nr_name, "ix0");
899    nmr.nm_version = NETMAP_API;
900    ioctl(fd, NIOCREGIF, &nmr);
901    p = mmap(0, nmr.nr_memsize, fd);
902    nifp = NETMAP_IF(p, nmr.nr_offset);
903    ring = NETMAP_TXRING(nifp, 0);
904    fds.fd = fd;
905    fds.events = POLLOUT;
906    for (;;) {
907	poll(&fds, 1, -1);
908	while (!nm_ring_empty(ring)) {
909	    i = ring->cur;
910	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
911	    ... prepare packet in buf ...
912	    ring->slot[i].len = ... packet length ...
913	    ring->head = ring->cur = nm_ring_next(ring, i);
914	}
915    }
916}
917.Ed
918.Ss HELPER FUNCTIONS
919A simple receiver can be implemented using the helper functions
920.Bd -literal -compact
921#define NETMAP_WITH_LIBS
922#include <net/netmap_user.h>
923\&...
924void receiver(void)
925{
926    struct nm_desc *d;
927    struct pollfd fds;
928    u_char *buf;
929    struct nm_pkthdr h;
930    ...
931    d = nm_open("netmap:ix0", NULL, 0, 0);
932    fds.fd = NETMAP_FD(d);
933    fds.events = POLLIN;
934    for (;;) {
935	poll(&fds, 1, -1);
936        while ( (buf = nm_nextpkt(d, &h)) )
937	    consume_pkt(buf, h->len);
938    }
939    nm_close(d);
940}
941.Ed
942.Ss ZERO-COPY FORWARDING
943Since physical interfaces share the same memory region,
944it is possible to do packet forwarding between ports
945swapping buffers. The buffer from the transmit ring is used
946to replenish the receive ring:
947.Bd -literal -compact
948    uint32_t tmp;
949    struct netmap_slot *src, *dst;
950    ...
951    src = &src_ring->slot[rxr->cur];
952    dst = &dst_ring->slot[txr->cur];
953    tmp = dst->buf_idx;
954    dst->buf_idx = src->buf_idx;
955    dst->len = src->len;
956    dst->flags = NS_BUF_CHANGED;
957    src->buf_idx = tmp;
958    src->flags = NS_BUF_CHANGED;
959    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
960    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
961    ...
962.Ed
963.Ss ACCESSING THE HOST STACK
964The host stack is for all practical purposes just a regular ring pair,
965which you can access with the netmap API (e.g. with
966.Dl nm_open("netmap:eth0^", ... ) ;
967All packets that the host would send to an interface in
968.Nm
969mode end up into the RX ring, whereas all packets queued to the
970TX ring are send up to the host stack.
971.Ss VALE SWITCH
972A simple way to test the performance of a
973.Nm VALE
974switch is to attach a sender and a receiver to it,
975e.g. running the following in two different terminals:
976.Dl pkt-gen -i vale1:a -f rx # receiver
977.Dl pkt-gen -i vale1:b -f tx # sender
978The same example can be used to test netmap pipes, by simply
979changing port names, e.g.
980.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
981.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
982.Pp
983The following command attaches an interface and the host stack
984to a switch:
985.Dl vale-ctl -h vale2:em0
986Other
987.Nm
988clients attached to the same switch can now communicate
989with the network card or the host.
990.Sh SEE ALSO
991.Pa http://info.iet.unipi.it/~luigi/netmap/
992.Pp
993Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
994Communications of the ACM, 55 (3), pp.45-51, March 2012
995.Pp
996Luigi Rizzo, netmap: a novel framework for fast packet I/O,
997Usenix ATC'12, June 2012, Boston
998.Pp
999Luigi Rizzo, Giuseppe Lettieri,
1000VALE, a switched ethernet for virtual machines,
1001ACM CoNEXT'12, December 2012, Nice
1002.Pp
1003Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1004Speeding up packet I/O in virtual machines,
1005ACM/IEEE ANCS'13, October 2013, San Jose
1006.Sh AUTHORS
1007.An -nosplit
1008The
1009.Nm
1010framework has been originally designed and implemented at the
1011Universita` di Pisa in 2011 by
1012.An Luigi Rizzo ,
1013and further extended with help from
1014.An Matteo Landi ,
1015.An Gaetano Catalli ,
1016.An Giuseppe Lettieri ,
1017.An Vincenzo Maffione .
1018.Pp
1019.Nm
1020and
1021.Nm VALE
1022have been funded by the European Commission within FP7 Projects
1023CHANGE (257422) and OPENLAB (287581).
1024.Sh CAVEATS
1025No matter how fast the CPU and OS are,
1026achieving line rate on 10G and faster interfaces
1027requires hardware with sufficient performance.
1028Several NICs are unable to sustain line rate with
1029small packet sizes. Insufficient PCIe or memory bandwidth
1030can also cause reduced performance.
1031.Pp
1032Another frequent reason for low performance is the use
1033of flow control on the link: a slow receiver can limit
1034the transmit speed.
1035Be sure to disable flow control when running high
1036speed experiments.
1037.Pp
1038.Ss SPECIAL NIC FEATURES
1039.Nm
1040is orthogonal to some NIC features such as
1041multiqueue, schedulers, packet filters.
1042.Pp
1043Multiple transmit and receive rings are supported natively
1044and can be configured with ordinary OS tools,
1045such as
1046.Xr ethtool
1047or
1048device-specific sysctl variables.
1049The same goes for Receive Packet Steering (RPS)
1050and filtering of incoming traffic.
1051.Pp
1052.Nm
1053.Em does not use
1054features such as
1055.Em checksum offloading , TCP segmentation offloading ,
1056.Em encryption , VLAN encapsulation/decapsulation ,
1057etc. .
1058When using netmap to exchange packets with the host stack,
1059make sure to disable these features.
1060