xref: /freebsd/share/man/man4/netmap.4 (revision 52f72944b8f5abb2386eae924357dee8aea17d5b)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd March 2, 2017
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.Nm VALE
37.Nd a fast VirtuAl Local Ethernet using the netmap API
38.Pp
39.Nm netmap pipes
40.Nd a shared memory packet transport channel
41.Sh SYNOPSIS
42.Cd device netmap
43.Sh DESCRIPTION
44.Nm
45is a framework for extremely fast and efficient packet I/O
46for userspace and kernel clients, and for Virtual Machines.
47It runs on
48.Fx
49Linux and some versions of Windows, and supports a variety of
50.Nm netmap ports ,
51including
52.Bl -tag -width XXXX
53.It Nm physical NIC ports
54to access individual queues of network interfaces;
55.It Nm host ports
56to inject packets into the host stack;
57.It Nm VALE ports
58implementing a very fast and modular in-kernel software switch/dataplane;
59.It Nm netmap pipes
60a shared memory packet transport channel;
61.It Nm netmap monitors
62a mechanism similar to
63.Xr bpf 4
64to capture traffic
65.El
66.Pp
67All these
68.Nm netmap ports
69are accessed interchangeably with the same API,
70and are at least one order of magnitude faster than
71standard OS mechanisms
72(sockets, bpf, tun/tap interfaces, native switches, pipes).
73With suitably fast hardware (NICs, PCIe buses, CPUs),
74packet I/O using
75.Nm
76on supported NICs
77reaches 14.88 million packets per second (Mpps)
78with much less than one core on 10 Gbit/s NICs;
7935-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
80about 20 Mpps per core for VALE ports;
81and over 100 Mpps for
82.Nm netmap pipes.
83NICs without native
84.Nm
85support can still use the API in emulated mode,
86which uses unmodified device drivers and is 3-5 times faster than
87.Xr bpf 4
88or raw sockets.
89.Pp
90Userspace clients can dynamically switch NICs into
91.Nm
92mode and send and receive raw packets through
93memory mapped buffers.
94Similarly,
95.Nm VALE
96switch instances and ports,
97.Nm netmap pipes
98and
99.Nm netmap monitors
100can be created dynamically,
101providing high speed packet I/O between processes,
102virtual machines, NICs and the host stack.
103.Pp
104.Nm
105supports both non-blocking I/O through
106.Xr ioctl 2 ,
107synchronization and blocking I/O through a file descriptor
108and standard OS mechanisms such as
109.Xr select 2 ,
110.Xr poll 2 ,
111.Xr epoll 2 ,
112and
113.Xr kqueue 2 .
114All types of
115.Nm netmap ports
116and the
117.Nm VALE switch
118are implemented by a single kernel module, which also emulates the
119.Nm
120API over standard drivers.
121For best performance,
122.Nm
123requires native support in device drivers.
124A list of such devices is at the end of this document.
125.Pp
126In the rest of this (long) manual page we document
127various aspects of the
128.Nm
129and
130.Nm VALE
131architecture, features and usage.
132.Sh ARCHITECTURE
133.Nm
134supports raw packet I/O through a
135.Em port ,
136which can be connected to a physical interface
137.Em ( NIC ) ,
138to the host stack,
139or to a
140.Nm VALE
141switch.
142Ports use preallocated circular queues of buffers
143.Em ( rings )
144residing in an mmapped region.
145There is one ring for each transmit/receive queue of a
146NIC or virtual port.
147An additional ring pair connects to the host stack.
148.Pp
149After binding a file descriptor to a port, a
150.Nm
151client can send or receive packets in batches through
152the rings, and possibly implement zero-copy forwarding
153between ports.
154.Pp
155All NICs operating in
156.Nm
157mode use the same memory region,
158accessible to all processes who own
159.Pa /dev/netmap
160file descriptors bound to NICs.
161Independent
162.Nm VALE
163and
164.Nm netmap pipe
165ports
166by default use separate memory regions,
167but can be independently configured to share memory.
168.Sh ENTERING AND EXITING NETMAP MODE
169The following section describes the system calls to create
170and control
171.Nm netmap
172ports (including
173.Nm VALE
174and
175.Nm netmap pipe
176ports).
177Simpler, higher level functions are described in the
178.Sx LIBRARIES
179section.
180.Pp
181Ports and rings are created and controlled through a file descriptor,
182created by opening a special device
183.Dl fd = open("/dev/netmap");
184and then bound to a specific port with an
185.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
186.Pp
187.Nm
188has multiple modes of operation controlled by the
189.Vt struct nmreq
190argument.
191.Va arg.nr_name
192specifies the netmap port name, as follows:
193.Bl -tag -width XXXX
194.It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
195the data path of the NIC is disconnected from the host stack,
196and the file descriptor is bound to the NIC (one or all queues),
197or to the host stack;
198.It Dv valeSSS:PPP
199the file descriptor is bound to port PPP of VALE switch SSS.
200Switch instances and ports are dynamically created if necessary.
201.Pp
202Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string
203cannot exceed IFNAMSIZ characters, and PPP cannot
204be the name of any existing OS network interface.
205.El
206.Pp
207On return,
208.Va arg
209indicates the size of the shared memory region,
210and the number, size and location of all the
211.Nm
212data structures, which can be accessed by mmapping the memory
213.Dl char *mem = mmap(0, arg.nr_memsize, fd);
214.Pp
215Non-blocking I/O is done with special
216.Xr ioctl 2
217.Xr select 2
218and
219.Xr poll 2
220on the file descriptor permit blocking I/O.
221.Xr epoll 2
222and
223.Xr kqueue 2
224are not supported on
225.Nm
226file descriptors.
227.Pp
228While a NIC is in
229.Nm
230mode, the OS will still believe the interface is up and running.
231OS-generated packets for that NIC end up into a
232.Nm
233ring, and another ring is used to send packets into the OS network stack.
234A
235.Xr close 2
236on the file descriptor removes the binding,
237and returns the NIC to normal mode (reconnecting the data path
238to the host stack), or destroys the virtual port.
239.Sh DATA STRUCTURES
240The data structures in the mmapped memory region are detailed in
241.In sys/net/netmap.h ,
242which is the ultimate reference for the
243.Nm
244API.
245The main structures and fields are indicated below:
246.Bl -tag -width XXX
247.It Dv struct netmap_if (one per interface)
248.Bd -literal
249struct netmap_if {
250    ...
251    const uint32_t   ni_flags;      /* properties              */
252    ...
253    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
254    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
255    uint32_t         ni_bufs_head;  /* head of extra bufs list */
256    ...
257};
258.Ed
259.Pp
260Indicates the number of available rings
261.Pa ( struct netmap_rings )
262and their position in the mmapped region.
263The number of tx and rx rings
264.Pa ( ni_tx_rings , ni_rx_rings )
265normally depends on the hardware.
266NICs also have an extra tx/rx ring pair connected to the host stack.
267.Em NIOCREGIF
268can also request additional unbound buffers in the same memory space,
269to be used as temporary storage for packets.
270.Pa ni_bufs_head
271contains the index of the first of these free rings,
272which are connected in a list (the first uint32_t of each
273buffer being the index of the next buffer in the list).
274A
275.Dv 0
276indicates the end of the list.
277.It Dv struct netmap_ring (one per ring)
278.Bd -literal
279struct netmap_ring {
280    ...
281    const uint32_t num_slots;   /* slots in each ring            */
282    const uint32_t nr_buf_size; /* size of each buffer           */
283    ...
284    uint32_t       head;        /* (u) first buf owned by user   */
285    uint32_t       cur;         /* (u) wakeup position           */
286    const uint32_t tail;        /* (k) first buf owned by kernel */
287    ...
288    uint32_t       flags;
289    struct timeval ts;          /* (k) time of last rxsync()     */
290    ...
291    struct netmap_slot slot[0]; /* array of slots                */
292}
293.Ed
294.Pp
295Implements transmit and receive rings, with read/write
296pointers, metadata and an array of
297.Em slots
298describing the buffers.
299.It Dv struct netmap_slot (one per buffer)
300.Bd -literal
301struct netmap_slot {
302    uint32_t buf_idx;           /* buffer index                 */
303    uint16_t len;               /* packet length                */
304    uint16_t flags;             /* buf changed, etc.            */
305    uint64_t ptr;               /* address for indirect buffers */
306};
307.Ed
308.Pp
309Describes a packet buffer, which normally is identified by
310an index and resides in the mmapped region.
311.It Dv packet buffers
312Fixed size (normally 2 KB) packet buffers allocated by the kernel.
313.El
314.Pp
315The offset of the
316.Pa struct netmap_if
317in the mmapped region is indicated by the
318.Pa nr_offset
319field in the structure returned by
320.Dv NIOCREGIF .
321From there, all other objects are reachable through
322relative references (offsets or indexes).
323Macros and functions in
324.In net/netmap_user.h
325help converting them into actual pointers:
326.Pp
327.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
328.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
329.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
330.Pp
331.Dl char *buf = NETMAP_BUF(ring, buffer_index);
332.Sh RINGS, BUFFERS AND DATA I/O
333.Va Rings
334are circular queues of packets with three indexes/pointers
335.Va ( head , cur , tail ) ;
336one slot is always kept empty.
337The ring size
338.Va ( num_slots )
339should not be assumed to be a power of two.
340.Pp
341.Va head
342is the first slot available to userspace;
343.Pp
344.Va cur
345is the wakeup point:
346select/poll will unblock when
347.Va tail
348passes
349.Va cur ;
350.Pp
351.Va tail
352is the first slot reserved to the kernel.
353.Pp
354Slot indexes
355.Em must
356only move forward;
357for convenience, the function
358.Dl nm_ring_next(ring, index)
359returns the next index modulo the ring size.
360.Pp
361.Va head
362and
363.Va cur
364are only modified by the user program;
365.Va tail
366is only modified by the kernel.
367The kernel only reads/writes the
368.Vt struct netmap_ring
369slots and buffers
370during the execution of a netmap-related system call.
371The only exception are slots (and buffers) in the range
372.Va tail\  . . . head-1 ,
373that are explicitly assigned to the kernel.
374.Pp
375.Ss TRANSMIT RINGS
376On transmit rings, after a
377.Nm
378system call, slots in the range
379.Va head\  . . . tail-1
380are available for transmission.
381User code should fill the slots sequentially
382and advance
383.Va head
384and
385.Va cur
386past slots ready to transmit.
387.Va cur
388may be moved further ahead if the user code needs
389more slots before further transmissions (see
390.Sx SCATTER GATHER I/O ) .
391.Pp
392At the next NIOCTXSYNC/select()/poll(),
393slots up to
394.Va head-1
395are pushed to the port, and
396.Va tail
397may advance if further slots have become available.
398Below is an example of the evolution of a TX ring:
399.Bd -literal
400    after the syscall, slots between cur and tail are (a)vailable
401              head=cur   tail
402               |          |
403               v          v
404     TX  [.....aaaaaaaaaaa.............]
405
406    user creates new packets to (T)ransmit
407                head=cur tail
408                    |     |
409                    v     v
410     TX  [.....TTTTTaaaaaa.............]
411
412    NIOCTXSYNC/poll()/select() sends packets and reports new slots
413                head=cur      tail
414                    |          |
415                    v          v
416     TX  [..........aaaaaaaaaaa........]
417.Ed
418.Pp
419.Fn select
420and
421.Fn poll
422will block if there is no space in the ring, i.e.,
423.Dl ring->cur == ring->tail
424and return when new slots have become available.
425.Pp
426High speed applications may want to amortize the cost of system calls
427by preparing as many packets as possible before issuing them.
428.Pp
429A transmit ring with pending transmissions has
430.Dl ring->head != ring->tail + 1 (modulo the ring size).
431The function
432.Va int nm_tx_pending(ring)
433implements this test.
434.Ss RECEIVE RINGS
435On receive rings, after a
436.Nm
437system call, the slots in the range
438.Va head\& . . . tail-1
439contain received packets.
440User code should process them and advance
441.Va head
442and
443.Va cur
444past slots it wants to return to the kernel.
445.Va cur
446may be moved further ahead if the user code wants to
447wait for more packets
448without returning all the previous slots to the kernel.
449.Pp
450At the next NIOCRXSYNC/select()/poll(),
451slots up to
452.Va head-1
453are returned to the kernel for further receives, and
454.Va tail
455may advance to report new incoming packets.
456.Pp
457Below is an example of the evolution of an RX ring:
458.Bd -literal
459    after the syscall, there are some (h)eld and some (R)eceived slots
460           head  cur     tail
461            |     |       |
462            v     v       v
463     RX  [..hhhhhhRRRRRRRR..........]
464
465    user advances head and cur, releasing some slots and holding others
466               head cur  tail
467                 |  |     |
468                 v  v     v
469     RX  [..*****hhhRRRRRR...........]
470
471    NICRXSYNC/poll()/select() recovers slots and reports new packets
472               head cur        tail
473                 |  |           |
474                 v  v           v
475     RX  [.......hhhRRRRRRRRRRRR....]
476.Ed
477.Sh SLOTS AND PACKET BUFFERS
478Normally, packets should be stored in the netmap-allocated buffers
479assigned to slots when ports are bound to a file descriptor.
480One packet is fully contained in a single buffer.
481.Pp
482The following flags affect slot and buffer processing:
483.Bl -tag -width XXX
484.It NS_BUF_CHANGED
485.Em must
486be used when the
487.Va buf_idx
488in the slot is changed.
489This can be used to implement
490zero-copy forwarding, see
491.Sx ZERO-COPY FORWARDING .
492.It NS_REPORT
493reports when this buffer has been transmitted.
494Normally,
495.Nm
496notifies transmit completions in batches, hence signals
497can be delayed indefinitely.
498This flag helps detect
499when packets have been sent and a file descriptor can be closed.
500.It NS_FORWARD
501When a ring is in 'transparent' mode (see
502.Sx TRANSPARENT MODE ) ,
503packets marked with this flag are forwarded to the other endpoint
504at the next system call, thus restoring (in a selective way)
505the connection between a NIC and the host stack.
506.It NS_NO_LEARN
507tells the forwarding code that the source MAC address for this
508packet must not be used in the learning bridge code.
509.It NS_INDIRECT
510indicates that the packet's payload is in a user-supplied buffer
511whose user virtual address is in the 'ptr' field of the slot.
512The size can reach 65535 bytes.
513.Pp
514This is only supported on the transmit ring of
515.Nm VALE
516ports, and it helps reducing data copies in the interconnection
517of virtual machines.
518.It NS_MOREFRAG
519indicates that the packet continues with subsequent buffers;
520the last buffer in a packet must have the flag clear.
521.El
522.Sh SCATTER GATHER I/O
523Packets can span multiple slots if the
524.Va NS_MOREFRAG
525flag is set in all but the last slot.
526The maximum length of a chain is 64 buffers.
527This is normally used with
528.Nm VALE
529ports when connecting virtual machines, as they generate large
530TSO segments that are not split unless they reach a physical device.
531.Pp
532NOTE: The length field always refers to the individual
533fragment; there is no place with the total length of a packet.
534.Pp
535On receive rings the macro
536.Va NS_RFRAGS(slot)
537indicates the remaining number of slots for this packet,
538including the current one.
539Slots with a value greater than 1 also have NS_MOREFRAG set.
540.Sh IOCTLS
541.Nm
542uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
543for non-blocking I/O.
544They take no argument.
545Two more ioctls (NIOCGINFO, NIOCREGIF) are used
546to query and configure ports, with the following argument:
547.Bd -literal
548struct nmreq {
549    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
550    uint32_t  nr_version;        /* (i) API version                */
551    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
552    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
553    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
554    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
555    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
556    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
557    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
558    uint16_t  nr_cmd;            /* (i) special command            */
559    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
560    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
561    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
562    uint32_t  nr_flags           /* (i/o) open mode                */
563    ...
564};
565.Ed
566.Pp
567A file descriptor obtained through
568.Pa /dev/netmap
569also supports the ioctl supported by network devices, see
570.Xr netintro 4 .
571.Bl -tag -width XXXX
572.It Dv NIOCGINFO
573returns EINVAL if the named port does not support netmap.
574Otherwise, it returns 0 and (advisory) information
575about the port.
576Note that all the information below can change before the
577interface is actually put in netmap mode.
578.Bl -tag -width XX
579.It Pa nr_memsize
580indicates the size of the
581.Nm
582memory region.
583NICs in
584.Nm
585mode all share the same memory region,
586whereas
587.Nm VALE
588ports have independent regions for each port.
589.It Pa nr_tx_slots , nr_rx_slots
590indicate the size of transmit and receive rings.
591.It Pa nr_tx_rings , nr_rx_rings
592indicate the number of transmit
593and receive rings.
594Both ring number and sizes may be configured at runtime
595using interface-specific functions (e.g.,
596.Xr ethtool 8
597).
598.El
599.It Dv NIOCREGIF
600binds the port named in
601.Va nr_name
602to the file descriptor.
603For a physical device this also switches it into
604.Nm
605mode, disconnecting
606it from the host stack.
607Multiple file descriptors can be bound to the same port,
608with proper synchronization left to the user.
609.Pp
610The recommended way to bind a file descriptor to a port is
611to use function
612.Va nm_open(..)
613(see
614.Sx LIBRARIES )
615which parses names to access specific port types and
616enable features.
617In the following we document the main features.
618.Pp
619.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
620.Em netmap pipe ,
621consisting of two netmap ports with a crossover connection.
622A netmap pipe share the same memory space of the parent port,
623and is meant to enable configuration where a master process acts
624as a dispatcher towards slave processes.
625.Pp
626To enable this function, the
627.Pa nr_arg1
628field of the structure can be used as a hint to the kernel to
629indicate how many pipes we expect to use, and reserve extra space
630in the memory region.
631.Pp
632On return, it gives the same info as NIOCGINFO,
633with
634.Pa nr_ringid
635and
636.Pa nr_flags
637indicating the identity of the rings controlled through the file
638descriptor.
639.Pp
640.Va nr_flags
641.Va nr_ringid
642selects which rings are controlled through this file descriptor.
643Possible values of
644.Pa nr_flags
645are indicated below, together with the naming schemes
646that application libraries (such as the
647.Nm nm_open
648indicated below) can use to indicate the specific set of rings.
649In the example below, "netmap:foo" is any valid netmap port name.
650.Bl -tag -width XXXXX
651.It NR_REG_ALL_NIC                         "netmap:foo"
652(default) all hardware ring pairs
653.It NR_REG_SW            "netmap:foo^"
654the ``host rings'', connecting to the host stack.
655.It NR_REG_NIC_SW        "netmap:foo+"
656all hardware rings and the host rings
657.It NR_REG_ONE_NIC       "netmap:foo-i"
658only the i-th hardware ring pair, where the number is in
659.Pa nr_ringid ;
660.It NR_REG_PIPE_MASTER  "netmap:foo{i"
661the master side of the netmap pipe whose identifier (i) is in
662.Pa nr_ringid ;
663.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
664the slave side of the netmap pipe whose identifier (i) is in
665.Pa nr_ringid .
666.Pp
667The identifier of a pipe must be thought as part of the pipe name,
668and does not need to be sequential.
669On return the pipe
670will only have a single ring pair with index 0,
671irrespective of the value of
672.Va i.
673.El
674.Pp
675By default, a
676.Xr poll 2
677or
678.Xr select 2
679call pushes out any pending packets on the transmit ring, even if
680no write events are specified.
681The feature can be disabled by or-ing
682.Va NETMAP_NO_TX_POLL
683to the value written to
684.Va nr_ringid.
685When this feature is used,
686packets are transmitted only on
687.Va ioctl(NIOCTXSYNC)
688or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
689.Pp
690When registering a virtual interface that is dynamically created to a
691.Xr vale 4
692switch, we can specify the desired number of rings (1 by default,
693and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
694.It Dv NIOCTXSYNC
695tells the hardware of new packets to transmit, and updates the
696number of slots available for transmission.
697.It Dv NIOCRXSYNC
698tells the hardware of consumed packets, and asks for newly available
699packets.
700.El
701.Sh SELECT, POLL, EPOLL, KQUEUE.
702.Xr select 2
703and
704.Xr poll 2
705on a
706.Nm
707file descriptor process rings as indicated in
708.Sx TRANSMIT RINGS
709and
710.Sx RECEIVE RINGS ,
711respectively when write (POLLOUT) and read (POLLIN) events are requested.
712Both block if no slots are available in the ring
713.Va ( ring->cur == ring->tail ) .
714Depending on the platform,
715.Xr epoll 2
716and
717.Xr kqueue 2
718are supported too.
719.Pp
720Packets in transmit rings are normally pushed out
721(and buffers reclaimed) even without
722requesting write events.
723Passing the
724.Dv NETMAP_NO_TX_POLL
725flag to
726.Em NIOCREGIF
727disables this feature.
728By default, receive rings are processed only if read
729events are requested.
730Passing the
731.Dv NETMAP_DO_RX_POLL
732flag to
733.Em NIOCREGIF updates receive rings even without read events.
734Note that on epoll and kqueue,
735.Dv NETMAP_NO_TX_POLL
736and
737.Dv NETMAP_DO_RX_POLL
738only have an effect when some event is posted for the file descriptor.
739.Sh LIBRARIES
740The
741.Nm
742API is supposed to be used directly, both because of its simplicity and
743for efficient integration with applications.
744.Pp
745For convenience, the
746.In net/netmap_user.h
747header provides a few macros and functions to ease creating
748a file descriptor and doing I/O with a
749.Nm
750port.
751These are loosely modeled after the
752.Xr pcap 3
753API, to ease porting of libpcap-based applications to
754.Nm .
755To use these extra functions, programs should
756.Dl #define NETMAP_WITH_LIBS
757before
758.Dl #include <net/netmap_user.h>
759.Pp
760The following functions are available:
761.Bl -tag -width XXXXX
762.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
763similar to
764.Xr pcap_open 3pcap ,
765binds a file descriptor to a port.
766.Bl -tag -width XX
767.It Va ifname
768is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
769.Nm VALE
770port.
771.It Va req
772provides the initial values for the argument to the NIOCREGIF ioctl.
773The nm_flags and nm_ringid values are overwritten by parsing
774ifname and flags, and other fields can be overridden through
775the other two arguments.
776.It Va arg
777points to a struct nm_desc containing arguments (e.g., from a previously
778open file descriptor) that should override the defaults.
779The fields are used as described below
780.It Va flags
781can be set to a combination of the following flags:
782.Va NETMAP_NO_TX_POLL ,
783.Va NETMAP_DO_RX_POLL
784(copied into nr_ringid);
785.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
786avoids the mmap and uses the values from it);
787.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
788.Va NM_OPEN_ARG1 ,
789.Va NM_OPEN_ARG2 ,
790.Va NM_OPEN_ARG3 (uses the fields from arg);
791.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
792.El
793.It Va int nm_close(struct nm_desc *d)
794closes the file descriptor, unmaps memory, frees resources.
795.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
796similar to pcap_inject(), pushes a packet to a ring, returns the size
797of the packet is successful, or 0 on error;
798.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
799similar to pcap_dispatch(), applies a callback to incoming packets
800.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
801similar to pcap_next(), fetches the next packet
802.El
803.Sh SUPPORTED DEVICES
804.Nm
805natively supports the following devices:
806.Pp
807On FreeBSD:
808.Xr cxgbe 4 ,
809.Xr em 4 ,
810.Xr igb 4 ,
811.Xr ixgbe 4 ,
812.Xr ixl 4 ,
813.Xr lem 4 ,
814.Xr re 4 .
815.Pp
816On Linux
817.Xr e1000 4 ,
818.Xr e1000e 4 ,
819.Xr i40e 4 ,
820.Xr igb 4 ,
821.Xr ixgbe 4 ,
822.Xr r8169 4 .
823.Pp
824NICs without native support can still be used in
825.Nm
826mode through emulation.
827Performance is inferior to native netmap
828mode but still significantly higher than various raw socket types
829(bpf, PF_PACKET, etc.).
830Note that for slow devices (such as 1 Gbit/s and slower NICs,
831or several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
832emulated and native mode will likely have similar or same throughput.
833.Pp
834When emulation is in use, packet sniffer programs such as tcpdump
835could see received packets before they are diverted by netmap.
836This behaviour is not intentional, being just an artifact of the implementation
837of emulation.
838Note that in case the netmap application subsequently moves packets received
839from the emulated adapter onto the host RX ring, the sniffer will intercept
840those packets again, since the packets are injected to the host stack as they
841were received by the network interface.
842.Pp
843Emulation is also available for devices with native netmap support,
844which can be used for testing or performance comparison.
845The sysctl variable
846.Va dev.netmap.admode
847globally controls how netmap mode is implemented.
848.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
849Some aspect of the operation of
850.Nm
851are controlled through sysctl variables on FreeBSD
852.Em ( dev.netmap.* )
853and module parameters on Linux
854.Em ( /sys/module/netmap_lin/parameters/* ) :
855.Bl -tag -width indent
856.It Va dev.netmap.admode: 0
857Controls the use of native or emulated adapter mode.
858.Pp
8590 uses the best available option;
860.Pp
8611 forces native mode and fails if not available;
862.Pp
8632 forces emulated hence never fails.
864.It Va dev.netmap.generic_ringsize: 1024
865Ring size used for emulated netmap mode
866.It Va dev.netmap.generic_mit: 100000
867Controls interrupt moderation for emulated mode
868.It Va dev.netmap.mmap_unreg: 0
869.It Va dev.netmap.fwd: 0
870Forces NS_FORWARD mode
871.It Va dev.netmap.flags: 0
872.It Va dev.netmap.txsync_retry: 2
873.It Va dev.netmap.no_pendintr: 1
874Forces recovery of transmit buffers on system calls
875.It Va dev.netmap.mitigate: 1
876Propagates interrupt mitigation to user processes
877.It Va dev.netmap.no_timestamp: 0
878Disables the update of the timestamp in the netmap ring
879.It Va dev.netmap.verbose: 0
880Verbose kernel messages
881.It Va dev.netmap.buf_num: 163840
882.It Va dev.netmap.buf_size: 2048
883.It Va dev.netmap.ring_num: 200
884.It Va dev.netmap.ring_size: 36864
885.It Va dev.netmap.if_num: 100
886.It Va dev.netmap.if_size: 1024
887Sizes and number of objects (netmap_if, netmap_ring, buffers)
888for the global memory region.
889The only parameter worth modifying is
890.Va dev.netmap.buf_num
891as it impacts the total amount of memory used by netmap.
892.It Va dev.netmap.buf_curr_num: 0
893.It Va dev.netmap.buf_curr_size: 0
894.It Va dev.netmap.ring_curr_num: 0
895.It Va dev.netmap.ring_curr_size: 0
896.It Va dev.netmap.if_curr_num: 0
897.It Va dev.netmap.if_curr_size: 0
898Actual values in use.
899.It Va dev.netmap.bridge_batch: 1024
900Batch size used when moving packets across a
901.Nm VALE
902switch.
903Values above 64 generally guarantee good
904performance.
905.El
906.Sh SYSTEM CALLS
907.Nm
908uses
909.Xr select 2 ,
910.Xr poll 2 ,
911.Xr epoll 2
912and
913.Xr kqueue 2
914to wake up processes when significant events occur, and
915.Xr mmap 2
916to map memory.
917.Xr ioctl 2
918is used to configure ports and
919.Nm VALE switches .
920.Pp
921Applications may need to create threads and bind them to
922specific cores to improve performance, using standard
923OS primitives, see
924.Xr pthread 3 .
925In particular,
926.Xr pthread_setaffinity_np 3
927may be of use.
928.Sh EXAMPLES
929.Ss TEST PROGRAMS
930.Nm
931comes with a few programs that can be used for testing or
932simple applications.
933See the
934.Pa examples/
935directory in
936.Nm
937distributions, or
938.Pa tools/tools/netmap/
939directory in
940.Fx
941distributions.
942.Pp
943.Xr pkt-gen 8
944is a general purpose traffic source/sink.
945.Pp
946As an example
947.Dl pkt-gen -i ix0 -f tx -l 60
948can generate an infinite stream of minimum size packets, and
949.Dl pkt-gen -i ix0 -f rx
950is a traffic sink.
951Both print traffic statistics, to help monitor
952how the system performs.
953.Pp
954.Xr pkt-gen 8
955has many options can be uses to set packet sizes, addresses,
956rates, and use multiple send/receive threads and cores.
957.Pp
958.Xr bridge 4
959is another test program which interconnects two
960.Nm
961ports.
962It can be used for transparent forwarding between
963interfaces, as in
964.Dl bridge -i ix0 -i ix1
965or even connect the NIC to the host stack using netmap
966.Dl bridge -i ix0 -i ix0
967.Ss USING THE NATIVE API
968The following code implements a traffic generator
969.Pp
970.Bd -literal -compact
971#include <net/netmap_user.h>
972\&...
973void sender(void)
974{
975    struct netmap_if *nifp;
976    struct netmap_ring *ring;
977    struct nmreq nmr;
978    struct pollfd fds;
979
980    fd = open("/dev/netmap", O_RDWR);
981    bzero(&nmr, sizeof(nmr));
982    strcpy(nmr.nr_name, "ix0");
983    nmr.nm_version = NETMAP_API;
984    ioctl(fd, NIOCREGIF, &nmr);
985    p = mmap(0, nmr.nr_memsize, fd);
986    nifp = NETMAP_IF(p, nmr.nr_offset);
987    ring = NETMAP_TXRING(nifp, 0);
988    fds.fd = fd;
989    fds.events = POLLOUT;
990    for (;;) {
991	poll(&fds, 1, -1);
992	while (!nm_ring_empty(ring)) {
993	    i = ring->cur;
994	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
995	    ... prepare packet in buf ...
996	    ring->slot[i].len = ... packet length ...
997	    ring->head = ring->cur = nm_ring_next(ring, i);
998	}
999    }
1000}
1001.Ed
1002.Ss HELPER FUNCTIONS
1003A simple receiver can be implemented using the helper functions
1004.Bd -literal -compact
1005#define NETMAP_WITH_LIBS
1006#include <net/netmap_user.h>
1007\&...
1008void receiver(void)
1009{
1010    struct nm_desc *d;
1011    struct pollfd fds;
1012    u_char *buf;
1013    struct nm_pkthdr h;
1014    ...
1015    d = nm_open("netmap:ix0", NULL, 0, 0);
1016    fds.fd = NETMAP_FD(d);
1017    fds.events = POLLIN;
1018    for (;;) {
1019	poll(&fds, 1, -1);
1020        while ( (buf = nm_nextpkt(d, &h)) )
1021	    consume_pkt(buf, h->len);
1022    }
1023    nm_close(d);
1024}
1025.Ed
1026.Ss ZERO-COPY FORWARDING
1027Since physical interfaces share the same memory region,
1028it is possible to do packet forwarding between ports
1029swapping buffers.
1030The buffer from the transmit ring is used
1031to replenish the receive ring:
1032.Bd -literal -compact
1033    uint32_t tmp;
1034    struct netmap_slot *src, *dst;
1035    ...
1036    src = &src_ring->slot[rxr->cur];
1037    dst = &dst_ring->slot[txr->cur];
1038    tmp = dst->buf_idx;
1039    dst->buf_idx = src->buf_idx;
1040    dst->len = src->len;
1041    dst->flags = NS_BUF_CHANGED;
1042    src->buf_idx = tmp;
1043    src->flags = NS_BUF_CHANGED;
1044    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1045    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1046    ...
1047.Ed
1048.Ss ACCESSING THE HOST STACK
1049The host stack is for all practical purposes just a regular ring pair,
1050which you can access with the netmap API (e.g., with
1051.Dl nm_open("netmap:eth0^", ... ) ;
1052All packets that the host would send to an interface in
1053.Nm
1054mode end up into the RX ring, whereas all packets queued to the
1055TX ring are send up to the host stack.
1056.Ss VALE SWITCH
1057A simple way to test the performance of a
1058.Nm VALE
1059switch is to attach a sender and a receiver to it,
1060e.g., running the following in two different terminals:
1061.Dl pkt-gen -i vale1:a -f rx # receiver
1062.Dl pkt-gen -i vale1:b -f tx # sender
1063The same example can be used to test netmap pipes, by simply
1064changing port names, e.g.,
1065.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
1066.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
1067.Pp
1068The following command attaches an interface and the host stack
1069to a switch:
1070.Dl vale-ctl -h vale2:em0
1071Other
1072.Nm
1073clients attached to the same switch can now communicate
1074with the network card or the host.
1075.Sh SEE ALSO
1076.Pa http://info.iet.unipi.it/~luigi/netmap/
1077.Pp
1078Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1079Communications of the ACM, 55 (3), pp.45-51, March 2012
1080.Pp
1081Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1082Usenix ATC'12, June 2012, Boston
1083.Pp
1084Luigi Rizzo, Giuseppe Lettieri,
1085VALE, a switched ethernet for virtual machines,
1086ACM CoNEXT'12, December 2012, Nice
1087.Pp
1088Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1089Speeding up packet I/O in virtual machines,
1090ACM/IEEE ANCS'13, October 2013, San Jose
1091.Sh AUTHORS
1092.An -nosplit
1093The
1094.Nm
1095framework has been originally designed and implemented at the
1096Universita` di Pisa in 2011 by
1097.An Luigi Rizzo ,
1098and further extended with help from
1099.An Matteo Landi ,
1100.An Gaetano Catalli ,
1101.An Giuseppe Lettieri ,
1102and
1103.An Vincenzo Maffione .
1104.Pp
1105.Nm
1106and
1107.Nm VALE
1108have been funded by the European Commission within FP7 Projects
1109CHANGE (257422) and OPENLAB (287581).
1110.Sh CAVEATS
1111No matter how fast the CPU and OS are,
1112achieving line rate on 10G and faster interfaces
1113requires hardware with sufficient performance.
1114Several NICs are unable to sustain line rate with
1115small packet sizes.
1116Insufficient PCIe or memory bandwidth
1117can also cause reduced performance.
1118.Pp
1119Another frequent reason for low performance is the use
1120of flow control on the link: a slow receiver can limit
1121the transmit speed.
1122Be sure to disable flow control when running high
1123speed experiments.
1124.Ss SPECIAL NIC FEATURES
1125.Nm
1126is orthogonal to some NIC features such as
1127multiqueue, schedulers, packet filters.
1128.Pp
1129Multiple transmit and receive rings are supported natively
1130and can be configured with ordinary OS tools,
1131such as
1132.Xr ethtool 8
1133or
1134device-specific sysctl variables.
1135The same goes for Receive Packet Steering (RPS)
1136and filtering of incoming traffic.
1137.Pp
1138.Nm
1139.Em does not use
1140features such as
1141.Em checksum offloading , TCP segmentation offloading ,
1142.Em encryption , VLAN encapsulation/decapsulation ,
1143etc.
1144When using netmap to exchange packets with the host stack,
1145make sure to disable these features.
1146