xref: /freebsd/share/man/man4/netmap.4 (revision e91d04f7f7c5a727aa64f70256bd446d9e1437d3)
117885a7bSLuigi Rizzo.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
268b8534bSLuigi Rizzo.\" All rights reserved.
368b8534bSLuigi Rizzo.\"
468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without
568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions
668b8534bSLuigi Rizzo.\" are met:
768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright
868b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer.
968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright
1068b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer in the
1168b8534bSLuigi Rizzo.\"    documentation and/or other materials provided with the distribution.
1268b8534bSLuigi Rizzo.\"
1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
2368b8534bSLuigi Rizzo.\" SUCH DAMAGE.
2468b8534bSLuigi Rizzo.\"
2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4)
2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix.
2768b8534bSLuigi Rizzo.\"
2868b8534bSLuigi Rizzo.\" $FreeBSD$
2968b8534bSLuigi Rizzo.\"
30*e91d04f7SChristian Brueffer.Dd December 14, 2015
3168b8534bSLuigi Rizzo.Dt NETMAP 4
3268b8534bSLuigi Rizzo.Os
3368b8534bSLuigi Rizzo.Sh NAME
3468b8534bSLuigi Rizzo.Nm netmap
3568b8534bSLuigi Rizzo.Nd a framework for fast packet I/O
36*e91d04f7SChristian Brueffer.Pp
3717885a7bSLuigi Rizzo.Nm VALE
3817885a7bSLuigi Rizzo.Nd a fast VirtuAl Local Ethernet using the netmap API
39*e91d04f7SChristian Brueffer.Pp
40fa7db06bSLuigi Rizzo.Nm netmap pipes
41fa7db06bSLuigi Rizzo.Nd a shared memory packet transport channel
4268b8534bSLuigi Rizzo.Sh SYNOPSIS
4368b8534bSLuigi Rizzo.Cd device netmap
4468b8534bSLuigi Rizzo.Sh DESCRIPTION
4568b8534bSLuigi Rizzo.Nm
46ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O
47ce3ee1e7SLuigi Rizzofor both userspace and kernel clients.
48*e91d04f7SChristian BruefferIt runs on
49*e91d04f7SChristian Brueffer.Fx
50*e91d04f7SChristian Bruefferand Linux, and includes
5117885a7bSLuigi Rizzo.Nm VALE ,
52fa7db06bSLuigi Rizzoa very fast and modular in-kernel software switch/dataplane,
5317885a7bSLuigi Rizzoand
54fa7db06bSLuigi Rizzo.Nm netmap pipes ,
55fa7db06bSLuigi Rizzoa shared memory packet transport channel.
56fa7db06bSLuigi RizzoAll these are accessed interchangeably with the same API.
57fa7db06bSLuigi Rizzo.Pp
58*e91d04f7SChristian Brueffer.Nm ,
59*e91d04f7SChristian Brueffer.Nm VALE
60fa7db06bSLuigi Rizzoand
61fa7db06bSLuigi Rizzo.Nm netmap pipes
62fa7db06bSLuigi Rizzoare at least one order of magnitude faster than
63fa7db06bSLuigi Rizzostandard OS mechanisms
64fa7db06bSLuigi Rizzo(sockets, bpf, tun/tap interfaces, native switches, pipes),
65fa7db06bSLuigi Rizzoreaching 14.88 million packets per second (Mpps)
66fa7db06bSLuigi Rizzowith much less than one core on a 10 Gbit NIC,
67fa7db06bSLuigi Rizzoabout 20 Mpps per core for VALE ports,
68fa7db06bSLuigi Rizzoand over 100 Mpps for netmap pipes.
69ce3ee1e7SLuigi Rizzo.Pp
7017885a7bSLuigi RizzoUserspace clients can dynamically switch NICs into
7168b8534bSLuigi Rizzo.Nm
7217885a7bSLuigi Rizzomode and send and receive raw packets through
7317885a7bSLuigi Rizzomemory mapped buffers.
7417885a7bSLuigi RizzoSimilarly,
7517885a7bSLuigi Rizzo.Nm VALE
76fa7db06bSLuigi Rizzoswitch instances and ports, and
77fa7db06bSLuigi Rizzo.Nm netmap pipes
78fa7db06bSLuigi Rizzocan be created dynamically,
7917885a7bSLuigi Rizzoproviding high speed packet I/O between processes,
8017885a7bSLuigi Rizzovirtual machines, NICs and the host stack.
8117885a7bSLuigi Rizzo.Pp
82fa7db06bSLuigi Rizzo.Nm
83*e91d04f7SChristian Brueffersupports both non-blocking I/O through
84*e91d04f7SChristian Brueffer.Xr ioctl 2 ,
85fa7db06bSLuigi Rizzosynchronization and blocking I/O through a file descriptor
86fa7db06bSLuigi Rizzoand standard OS mechanisms such as
87fa7db06bSLuigi Rizzo.Xr select 2 ,
88fa7db06bSLuigi Rizzo.Xr poll 2 ,
89fa7db06bSLuigi Rizzo.Xr epoll 2 ,
90*e91d04f7SChristian Bruefferand
91fa7db06bSLuigi Rizzo.Xr kqueue 2 .
92fa7db06bSLuigi Rizzo.Nm VALE
93fa7db06bSLuigi Rizzoand
94fa7db06bSLuigi Rizzo.Nm netmap pipes
95fa7db06bSLuigi Rizzoare implemented by a single kernel module, which also emulates the
96fa7db06bSLuigi Rizzo.Nm
97fa7db06bSLuigi RizzoAPI over standard drivers for devices without native
98fa7db06bSLuigi Rizzo.Nm
99fa7db06bSLuigi Rizzosupport.
10017885a7bSLuigi RizzoFor best performance,
10168b8534bSLuigi Rizzo.Nm
102fa7db06bSLuigi Rizzorequires explicit support in device drivers.
103ce3ee1e7SLuigi Rizzo.Pp
10417885a7bSLuigi RizzoIn the rest of this (long) manual page we document
10517885a7bSLuigi Rizzovarious aspects of the
106ce3ee1e7SLuigi Rizzo.Nm
10717885a7bSLuigi Rizzoand
108ce3ee1e7SLuigi Rizzo.Nm VALE
10917885a7bSLuigi Rizzoarchitecture, features and usage.
11017885a7bSLuigi Rizzo.Sh ARCHITECTURE
11117885a7bSLuigi Rizzo.Nm
11217885a7bSLuigi Rizzosupports raw packet I/O through a
11317885a7bSLuigi Rizzo.Em port ,
11417885a7bSLuigi Rizzowhich can be connected to a physical interface
11517885a7bSLuigi Rizzo.Em ( NIC ) ,
11617885a7bSLuigi Rizzoto the host stack,
11717885a7bSLuigi Rizzoor to a
11817885a7bSLuigi Rizzo.Nm VALE
11917885a7bSLuigi Rizzoswitch).
12017885a7bSLuigi RizzoPorts use preallocated circular queues of buffers
12117885a7bSLuigi Rizzo.Em ( rings )
12217885a7bSLuigi Rizzoresiding in an mmapped region.
12317885a7bSLuigi RizzoThere is one ring for each transmit/receive queue of a
12417885a7bSLuigi RizzoNIC or virtual port.
12517885a7bSLuigi RizzoAn additional ring pair connects to the host stack.
126ce3ee1e7SLuigi Rizzo.Pp
12717885a7bSLuigi RizzoAfter binding a file descriptor to a port, a
12817885a7bSLuigi Rizzo.Nm
12917885a7bSLuigi Rizzoclient can send or receive packets in batches through
13017885a7bSLuigi Rizzothe rings, and possibly implement zero-copy forwarding
13117885a7bSLuigi Rizzobetween ports.
132ce3ee1e7SLuigi Rizzo.Pp
13317885a7bSLuigi RizzoAll NICs operating in
13468b8534bSLuigi Rizzo.Nm
135ce3ee1e7SLuigi Rizzomode use the same memory region,
13617885a7bSLuigi Rizzoaccessible to all processes who own
137*e91d04f7SChristian Brueffer.Pa /dev/netmap
13817885a7bSLuigi Rizzofile descriptors bound to NICs.
139fa7db06bSLuigi RizzoIndependent
14017885a7bSLuigi Rizzo.Nm VALE
141fa7db06bSLuigi Rizzoand
142fa7db06bSLuigi Rizzo.Nm netmap pipe
143fa7db06bSLuigi Rizzoports
144fa7db06bSLuigi Rizzoby default use separate memory regions,
145fa7db06bSLuigi Rizzobut can be independently configured to share memory.
14617885a7bSLuigi Rizzo.Sh ENTERING AND EXITING NETMAP MODE
147fa7db06bSLuigi RizzoThe following section describes the system calls to create
148fa7db06bSLuigi Rizzoand control
149fa7db06bSLuigi Rizzo.Nm netmap
150fa7db06bSLuigi Rizzoports (including
151fa7db06bSLuigi Rizzo.Nm VALE
152fa7db06bSLuigi Rizzoand
153fa7db06bSLuigi Rizzo.Nm netmap pipe
154fa7db06bSLuigi Rizzoports).
155fa7db06bSLuigi RizzoSimpler, higher level functions are described in section
156fa7db06bSLuigi Rizzo.Xr LIBRARIES .
157fa7db06bSLuigi Rizzo.Pp
15817885a7bSLuigi RizzoPorts and rings are created and controlled through a file descriptor,
15917885a7bSLuigi Rizzocreated by opening a special device
16017885a7bSLuigi Rizzo.Dl fd = open("/dev/netmap");
16117885a7bSLuigi Rizzoand then bound to a specific port with an
16217885a7bSLuigi Rizzo.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
16317885a7bSLuigi Rizzo.Pp
16417885a7bSLuigi Rizzo.Nm
16517885a7bSLuigi Rizzohas multiple modes of operation controlled by the
16617885a7bSLuigi Rizzo.Vt struct nmreq
16717885a7bSLuigi Rizzoargument.
16817885a7bSLuigi Rizzo.Va arg.nr_name
16917885a7bSLuigi Rizzospecifies the port name, as follows:
17017885a7bSLuigi Rizzo.Bl -tag -width XXXX
17117885a7bSLuigi Rizzo.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
17217885a7bSLuigi Rizzothe data path of the NIC is disconnected from the host stack,
17317885a7bSLuigi Rizzoand the file descriptor is bound to the NIC (one or all queues),
17417885a7bSLuigi Rizzoor to the host stack;
17517885a7bSLuigi Rizzo.It Dv valeXXX:YYY (arbitrary XXX and YYY)
17617885a7bSLuigi Rizzothe file descriptor is bound to port YYY of a VALE switch called XXX,
17717885a7bSLuigi Rizzoboth dynamically created if necessary.
17817885a7bSLuigi RizzoThe string cannot exceed IFNAMSIZ characters, and YYY cannot
17917885a7bSLuigi Rizzobe the name of any existing OS network interface.
18017885a7bSLuigi Rizzo.El
18117885a7bSLuigi Rizzo.Pp
18217885a7bSLuigi RizzoOn return,
18317885a7bSLuigi Rizzo.Va arg
18417885a7bSLuigi Rizzoindicates the size of the shared memory region,
18517885a7bSLuigi Rizzoand the number, size and location of all the
18617885a7bSLuigi Rizzo.Nm
18717885a7bSLuigi Rizzodata structures, which can be accessed by mmapping the memory
18817885a7bSLuigi Rizzo.Dl char *mem = mmap(0, arg.nr_memsize, fd);
18917885a7bSLuigi Rizzo.Pp
190*e91d04f7SChristian BruefferNon-blocking I/O is done with special
19117885a7bSLuigi Rizzo.Xr ioctl 2
19217885a7bSLuigi Rizzo.Xr select 2
19317885a7bSLuigi Rizzoand
19417885a7bSLuigi Rizzo.Xr poll 2
19517885a7bSLuigi Rizzoon the file descriptor permit blocking I/O.
19617885a7bSLuigi Rizzo.Xr epoll 2
19717885a7bSLuigi Rizzoand
19817885a7bSLuigi Rizzo.Xr kqueue 2
19917885a7bSLuigi Rizzoare not supported on
20017885a7bSLuigi Rizzo.Nm
20117885a7bSLuigi Rizzofile descriptors.
20217885a7bSLuigi Rizzo.Pp
20317885a7bSLuigi RizzoWhile a NIC is in
20417885a7bSLuigi Rizzo.Nm
20517885a7bSLuigi Rizzomode, the OS will still believe the interface is up and running.
20617885a7bSLuigi RizzoOS-generated packets for that NIC end up into a
20717885a7bSLuigi Rizzo.Nm
20817885a7bSLuigi Rizzoring, and another ring is used to send packets into the OS network stack.
20917885a7bSLuigi RizzoA
21017885a7bSLuigi Rizzo.Xr close 2
21117885a7bSLuigi Rizzoon the file descriptor removes the binding,
21217885a7bSLuigi Rizzoand returns the NIC to normal mode (reconnecting the data path
21317885a7bSLuigi Rizzoto the host stack), or destroys the virtual port.
21417885a7bSLuigi Rizzo.Sh DATA STRUCTURES
21517885a7bSLuigi RizzoThe data structures in the mmapped memory region are detailed in
216*e91d04f7SChristian Brueffer.In sys/net/netmap.h ,
21717885a7bSLuigi Rizzowhich is the ultimate reference for the
21817885a7bSLuigi Rizzo.Nm
219*e91d04f7SChristian BruefferAPI.
220*e91d04f7SChristian BruefferThe main structures and fields are indicated below:
22168b8534bSLuigi Rizzo.Bl -tag -width XXX
22268b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface)
22368b8534bSLuigi Rizzo.Bd -literal
22468b8534bSLuigi Rizzostruct netmap_if {
22517885a7bSLuigi Rizzo    ...
22617885a7bSLuigi Rizzo    const uint32_t   ni_flags;      /* properties              */
22717885a7bSLuigi Rizzo    ...
22817885a7bSLuigi Rizzo    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
22917885a7bSLuigi Rizzo    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
230fa7db06bSLuigi Rizzo    uint32_t         ni_bufs_head;  /* head of extra bufs list */
23117885a7bSLuigi Rizzo    ...
23268b8534bSLuigi Rizzo};
23368b8534bSLuigi Rizzo.Ed
234ce3ee1e7SLuigi Rizzo.Pp
23517885a7bSLuigi RizzoIndicates the number of available rings
23617885a7bSLuigi Rizzo.Pa ( struct netmap_rings )
23717885a7bSLuigi Rizzoand their position in the mmapped region.
23817885a7bSLuigi RizzoThe number of tx and rx rings
23917885a7bSLuigi Rizzo.Pa ( ni_tx_rings , ni_rx_rings )
24017885a7bSLuigi Rizzonormally depends on the hardware.
24117885a7bSLuigi RizzoNICs also have an extra tx/rx ring pair connected to the host stack.
24217885a7bSLuigi Rizzo.Em NIOCREGIF
243fa7db06bSLuigi Rizzocan also request additional unbound buffers in the same memory space,
244fa7db06bSLuigi Rizzoto be used as temporary storage for packets.
245fa7db06bSLuigi Rizzo.Pa ni_bufs_head
246fa7db06bSLuigi Rizzocontains the index of the first of these free rings,
247fa7db06bSLuigi Rizzowhich are connected in a list (the first uint32_t of each
248fa7db06bSLuigi Rizzobuffer being the index of the next buffer in the list).
249*e91d04f7SChristian BruefferA
250*e91d04f7SChristian Brueffer.Dv 0
251*e91d04f7SChristian Bruefferindicates the end of the list.
25217885a7bSLuigi Rizzo.It Dv struct netmap_ring (one per ring)
25368b8534bSLuigi Rizzo.Bd -literal
25468b8534bSLuigi Rizzostruct netmap_ring {
25517885a7bSLuigi Rizzo    ...
25617885a7bSLuigi Rizzo    const uint32_t num_slots;   /* slots in each ring            */
25717885a7bSLuigi Rizzo    const uint32_t nr_buf_size; /* size of each buffer           */
25817885a7bSLuigi Rizzo    ...
25917885a7bSLuigi Rizzo    uint32_t       head;        /* (u) first buf owned by user   */
26017885a7bSLuigi Rizzo    uint32_t       cur;         /* (u) wakeup position           */
26117885a7bSLuigi Rizzo    const uint32_t tail;        /* (k) first buf owned by kernel */
26217885a7bSLuigi Rizzo    ...
26317885a7bSLuigi Rizzo    uint32_t       flags;
26417885a7bSLuigi Rizzo    struct timeval ts;          /* (k) time of last rxsync()     */
26517885a7bSLuigi Rizzo    ...
266ce3ee1e7SLuigi Rizzo    struct netmap_slot slot[0]; /* array of slots                */
26768b8534bSLuigi Rizzo}
26868b8534bSLuigi Rizzo.Ed
269ce3ee1e7SLuigi Rizzo.Pp
27017885a7bSLuigi RizzoImplements transmit and receive rings, with read/write
271*e91d04f7SChristian Bruefferpointers, metadata and an array of
272*e91d04f7SChristian Brueffer.Em slots
27317885a7bSLuigi Rizzodescribing the buffers.
27417885a7bSLuigi Rizzo.It Dv struct netmap_slot (one per buffer)
27568b8534bSLuigi Rizzo.Bd -literal
27668b8534bSLuigi Rizzostruct netmap_slot {
27768b8534bSLuigi Rizzo    uint32_t buf_idx;           /* buffer index                 */
27868b8534bSLuigi Rizzo    uint16_t len;               /* packet length                */
27968b8534bSLuigi Rizzo    uint16_t flags;             /* buf changed, etc.            */
28017885a7bSLuigi Rizzo    uint64_t ptr;               /* address for indirect buffers */
28168b8534bSLuigi Rizzo};
28268b8534bSLuigi Rizzo.Ed
28317885a7bSLuigi Rizzo.Pp
28417885a7bSLuigi RizzoDescribes a packet buffer, which normally is identified by
28517885a7bSLuigi Rizzoan index and resides in the mmapped region.
28668b8534bSLuigi Rizzo.It Dv packet buffers
28717885a7bSLuigi RizzoFixed size (normally 2 KB) packet buffers allocated by the kernel.
288ce3ee1e7SLuigi Rizzo.El
289ce3ee1e7SLuigi Rizzo.Pp
29017885a7bSLuigi RizzoThe offset of the
29117885a7bSLuigi Rizzo.Pa struct netmap_if
29217885a7bSLuigi Rizzoin the mmapped region is indicated by the
29317885a7bSLuigi Rizzo.Pa nr_offset
29417885a7bSLuigi Rizzofield in the structure returned by
295*e91d04f7SChristian Brueffer.Dv NIOCREGIF .
29617885a7bSLuigi RizzoFrom there, all other objects are reachable through
29717885a7bSLuigi Rizzorelative references (offsets or indexes).
298*e91d04f7SChristian BruefferMacros and functions in
299*e91d04f7SChristian Brueffer.In net/netmap_user.h
30017885a7bSLuigi Rizzohelp converting them into actual pointers:
30117885a7bSLuigi Rizzo.Pp
30217885a7bSLuigi Rizzo.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
30317885a7bSLuigi Rizzo.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
30417885a7bSLuigi Rizzo.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
30517885a7bSLuigi Rizzo.Pp
30617885a7bSLuigi Rizzo.Dl char *buf = NETMAP_BUF(ring, buffer_index);
30717885a7bSLuigi Rizzo.Sh RINGS, BUFFERS AND DATA I/O
30817885a7bSLuigi Rizzo.Va Rings
30917885a7bSLuigi Rizzoare circular queues of packets with three indexes/pointers
31017885a7bSLuigi Rizzo.Va ( head , cur , tail ) ;
31117885a7bSLuigi Rizzoone slot is always kept empty.
31217885a7bSLuigi RizzoThe ring size
31317885a7bSLuigi Rizzo.Va ( num_slots )
31417885a7bSLuigi Rizzoshould not be assumed to be a power of two.
31517885a7bSLuigi Rizzo.br
31617885a7bSLuigi Rizzo(NOTE: older versions of netmap used head/count format to indicate
31717885a7bSLuigi Rizzothe content of a ring).
31817885a7bSLuigi Rizzo.Pp
31917885a7bSLuigi Rizzo.Va head
32017885a7bSLuigi Rizzois the first slot available to userspace;
32117885a7bSLuigi Rizzo.br
32217885a7bSLuigi Rizzo.Va cur
32317885a7bSLuigi Rizzois the wakeup point:
32417885a7bSLuigi Rizzoselect/poll will unblock when
32517885a7bSLuigi Rizzo.Va tail
32617885a7bSLuigi Rizzopasses
32717885a7bSLuigi Rizzo.Va cur ;
32817885a7bSLuigi Rizzo.br
32917885a7bSLuigi Rizzo.Va tail
33017885a7bSLuigi Rizzois the first slot reserved to the kernel.
33117885a7bSLuigi Rizzo.Pp
332*e91d04f7SChristian BruefferSlot indexes
333*e91d04f7SChristian Brueffer.Em must
334*e91d04f7SChristian Bruefferonly move forward;
33517885a7bSLuigi Rizzofor convenience, the function
33617885a7bSLuigi Rizzo.Dl nm_ring_next(ring, index)
33717885a7bSLuigi Rizzoreturns the next index modulo the ring size.
33817885a7bSLuigi Rizzo.Pp
33917885a7bSLuigi Rizzo.Va head
34017885a7bSLuigi Rizzoand
34117885a7bSLuigi Rizzo.Va cur
34217885a7bSLuigi Rizzoare only modified by the user program;
34317885a7bSLuigi Rizzo.Va tail
34417885a7bSLuigi Rizzois only modified by the kernel.
34517885a7bSLuigi RizzoThe kernel only reads/writes the
34617885a7bSLuigi Rizzo.Vt struct netmap_ring
34717885a7bSLuigi Rizzoslots and buffers
34817885a7bSLuigi Rizzoduring the execution of a netmap-related system call.
34917885a7bSLuigi RizzoThe only exception are slots (and buffers) in the range
35017885a7bSLuigi Rizzo.Va tail\  . . . head-1 ,
35117885a7bSLuigi Rizzothat are explicitly assigned to the kernel.
35217885a7bSLuigi Rizzo.Pp
35317885a7bSLuigi Rizzo.Ss TRANSMIT RINGS
35417885a7bSLuigi RizzoOn transmit rings, after a
35517885a7bSLuigi Rizzo.Nm
35617885a7bSLuigi Rizzosystem call, slots in the range
35717885a7bSLuigi Rizzo.Va head\  . . . tail-1
35817885a7bSLuigi Rizzoare available for transmission.
35917885a7bSLuigi RizzoUser code should fill the slots sequentially
36017885a7bSLuigi Rizzoand advance
36117885a7bSLuigi Rizzo.Va head
36217885a7bSLuigi Rizzoand
36317885a7bSLuigi Rizzo.Va cur
36417885a7bSLuigi Rizzopast slots ready to transmit.
36517885a7bSLuigi Rizzo.Va cur
36617885a7bSLuigi Rizzomay be moved further ahead if the user code needs
36717885a7bSLuigi Rizzomore slots before further transmissions (see
36817885a7bSLuigi Rizzo.Sx SCATTER GATHER I/O ) .
36917885a7bSLuigi Rizzo.Pp
37017885a7bSLuigi RizzoAt the next NIOCTXSYNC/select()/poll(),
37117885a7bSLuigi Rizzoslots up to
37217885a7bSLuigi Rizzo.Va head-1
37317885a7bSLuigi Rizzoare pushed to the port, and
37417885a7bSLuigi Rizzo.Va tail
37517885a7bSLuigi Rizzomay advance if further slots have become available.
37617885a7bSLuigi RizzoBelow is an example of the evolution of a TX ring:
37717885a7bSLuigi Rizzo.Bd -literal
37817885a7bSLuigi Rizzo    after the syscall, slots between cur and tail are (a)vailable
37917885a7bSLuigi Rizzo              head=cur   tail
38017885a7bSLuigi Rizzo               |          |
38117885a7bSLuigi Rizzo               v          v
38217885a7bSLuigi Rizzo     TX  [.....aaaaaaaaaaa.............]
38317885a7bSLuigi Rizzo
38417885a7bSLuigi Rizzo    user creates new packets to (T)ransmit
38517885a7bSLuigi Rizzo                head=cur tail
38617885a7bSLuigi Rizzo                    |     |
38717885a7bSLuigi Rizzo                    v     v
38817885a7bSLuigi Rizzo     TX  [.....TTTTTaaaaaa.............]
38917885a7bSLuigi Rizzo
39017885a7bSLuigi Rizzo    NIOCTXSYNC/poll()/select() sends packets and reports new slots
39117885a7bSLuigi Rizzo                head=cur      tail
39217885a7bSLuigi Rizzo                    |          |
39317885a7bSLuigi Rizzo                    v          v
39417885a7bSLuigi Rizzo     TX  [..........aaaaaaaaaaa........]
39517885a7bSLuigi Rizzo.Ed
39617885a7bSLuigi Rizzo.Pp
397*e91d04f7SChristian Brueffer.Fn select
398*e91d04f7SChristian Bruefferand
399*e91d04f7SChristian Brueffer.Fn poll
400*e91d04f7SChristian Bruefferwill block if there is no space in the ring, i.e.
40117885a7bSLuigi Rizzo.Dl ring->cur == ring->tail
40217885a7bSLuigi Rizzoand return when new slots have become available.
40317885a7bSLuigi Rizzo.Pp
40417885a7bSLuigi RizzoHigh speed applications may want to amortize the cost of system calls
40517885a7bSLuigi Rizzoby preparing as many packets as possible before issuing them.
40617885a7bSLuigi Rizzo.Pp
40717885a7bSLuigi RizzoA transmit ring with pending transmissions has
40817885a7bSLuigi Rizzo.Dl ring->head != ring->tail + 1 (modulo the ring size).
40917885a7bSLuigi RizzoThe function
41017885a7bSLuigi Rizzo.Va int nm_tx_pending(ring)
41117885a7bSLuigi Rizzoimplements this test.
41217885a7bSLuigi Rizzo.Ss RECEIVE RINGS
41317885a7bSLuigi RizzoOn receive rings, after a
41417885a7bSLuigi Rizzo.Nm
41517885a7bSLuigi Rizzosystem call, the slots in the range
41617885a7bSLuigi Rizzo.Va head\& . . . tail-1
41717885a7bSLuigi Rizzocontain received packets.
41817885a7bSLuigi RizzoUser code should process them and advance
41917885a7bSLuigi Rizzo.Va head
42017885a7bSLuigi Rizzoand
42117885a7bSLuigi Rizzo.Va cur
42217885a7bSLuigi Rizzopast slots it wants to return to the kernel.
42317885a7bSLuigi Rizzo.Va cur
42417885a7bSLuigi Rizzomay be moved further ahead if the user code wants to
42517885a7bSLuigi Rizzowait for more packets
42617885a7bSLuigi Rizzowithout returning all the previous slots to the kernel.
42717885a7bSLuigi Rizzo.Pp
42817885a7bSLuigi RizzoAt the next NIOCRXSYNC/select()/poll(),
42917885a7bSLuigi Rizzoslots up to
43017885a7bSLuigi Rizzo.Va head-1
43117885a7bSLuigi Rizzoare returned to the kernel for further receives, and
43217885a7bSLuigi Rizzo.Va tail
43317885a7bSLuigi Rizzomay advance to report new incoming packets.
43417885a7bSLuigi Rizzo.br
43517885a7bSLuigi RizzoBelow is an example of the evolution of an RX ring:
43617885a7bSLuigi Rizzo.Bd -literal
43717885a7bSLuigi Rizzo    after the syscall, there are some (h)eld and some (R)eceived slots
43817885a7bSLuigi Rizzo           head  cur     tail
43917885a7bSLuigi Rizzo            |     |       |
44017885a7bSLuigi Rizzo            v     v       v
44117885a7bSLuigi Rizzo     RX  [..hhhhhhRRRRRRRR..........]
44217885a7bSLuigi Rizzo
44317885a7bSLuigi Rizzo    user advances head and cur, releasing some slots and holding others
44417885a7bSLuigi Rizzo               head cur  tail
44517885a7bSLuigi Rizzo                 |  |     |
44617885a7bSLuigi Rizzo                 v  v     v
44717885a7bSLuigi Rizzo     RX  [..*****hhhRRRRRR...........]
44817885a7bSLuigi Rizzo
44917885a7bSLuigi Rizzo    NICRXSYNC/poll()/select() recovers slots and reports new packets
45017885a7bSLuigi Rizzo               head cur        tail
45117885a7bSLuigi Rizzo                 |  |           |
45217885a7bSLuigi Rizzo                 v  v           v
45317885a7bSLuigi Rizzo     RX  [.......hhhRRRRRRRRRRRR....]
45417885a7bSLuigi Rizzo.Ed
45517885a7bSLuigi Rizzo.Sh SLOTS AND PACKET BUFFERS
45617885a7bSLuigi RizzoNormally, packets should be stored in the netmap-allocated buffers
45717885a7bSLuigi Rizzoassigned to slots when ports are bound to a file descriptor.
45817885a7bSLuigi RizzoOne packet is fully contained in a single buffer.
45917885a7bSLuigi Rizzo.Pp
46017885a7bSLuigi RizzoThe following flags affect slot and buffer processing:
461ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX
462ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED
463*e91d04f7SChristian Brueffer.Em must
464*e91d04f7SChristian Bruefferbe used when the
465*e91d04f7SChristian Brueffer.Va buf_idx
466*e91d04f7SChristian Bruefferin the slot is changed.
46717885a7bSLuigi RizzoThis can be used to implement
46817885a7bSLuigi Rizzozero-copy forwarding, see
46917885a7bSLuigi Rizzo.Sx ZERO-COPY FORWARDING .
470ce3ee1e7SLuigi Rizzo.It NS_REPORT
47117885a7bSLuigi Rizzoreports when this buffer has been transmitted.
472ce3ee1e7SLuigi RizzoNormally,
473ce3ee1e7SLuigi Rizzo.Nm
474ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals
475*e91d04f7SChristian Brueffercan be delayed indefinitely.
476*e91d04f7SChristian BruefferThis flag helps detect
477*e91d04f7SChristian Bruefferwhen packets have been sent and a file descriptor can be closed.
478ce3ee1e7SLuigi Rizzo.It NS_FORWARD
47917885a7bSLuigi RizzoWhen a ring is in 'transparent' mode (see
48017885a7bSLuigi Rizzo.Sx TRANSPARENT MODE ) ,
481*e91d04f7SChristian Bruefferpackets marked with this flag are forwarded to the other endpoint
48217885a7bSLuigi Rizzoat the next system call, thus restoring (in a selective way)
48317885a7bSLuigi Rizzothe connection between a NIC and the host stack.
484ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN
485*e91d04f7SChristian Brueffertells the forwarding code that the source MAC address for this
48617885a7bSLuigi Rizzopacket must not be used in the learning bridge code.
487ce3ee1e7SLuigi Rizzo.It NS_INDIRECT
488*e91d04f7SChristian Bruefferindicates that the packet's payload is in a user-supplied buffer
48917885a7bSLuigi Rizzowhose user virtual address is in the 'ptr' field of the slot.
490ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes.
49117885a7bSLuigi Rizzo.br
49217885a7bSLuigi RizzoThis is only supported on the transmit ring of
49317885a7bSLuigi Rizzo.Nm VALE
49417885a7bSLuigi Rizzoports, and it helps reducing data copies in the interconnection
49517885a7bSLuigi Rizzoof virtual machines.
496ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG
497ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers;
498ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear.
499ce3ee1e7SLuigi Rizzo.El
50017885a7bSLuigi Rizzo.Sh SCATTER GATHER I/O
50117885a7bSLuigi RizzoPackets can span multiple slots if the
50217885a7bSLuigi Rizzo.Va NS_MOREFRAG
50317885a7bSLuigi Rizzoflag is set in all but the last slot.
50417885a7bSLuigi RizzoThe maximum length of a chain is 64 buffers.
50517885a7bSLuigi RizzoThis is normally used with
50617885a7bSLuigi Rizzo.Nm VALE
50717885a7bSLuigi Rizzoports when connecting virtual machines, as they generate large
50817885a7bSLuigi RizzoTSO segments that are not split unless they reach a physical device.
50917885a7bSLuigi Rizzo.Pp
51017885a7bSLuigi RizzoNOTE: The length field always refers to the individual
51117885a7bSLuigi Rizzofragment; there is no place with the total length of a packet.
51217885a7bSLuigi Rizzo.Pp
51317885a7bSLuigi RizzoOn receive rings the macro
51417885a7bSLuigi Rizzo.Va NS_RFRAGS(slot)
51517885a7bSLuigi Rizzoindicates the remaining number of slots for this packet,
51617885a7bSLuigi Rizzoincluding the current one.
51717885a7bSLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set.
51813a5d88fSLuigi Rizzo.Sh IOCTLS
51968b8534bSLuigi Rizzo.Nm
52017885a7bSLuigi Rizzouses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
521*e91d04f7SChristian Bruefferfor non-blocking I/O.
522*e91d04f7SChristian BruefferThey take no argument.
52317885a7bSLuigi RizzoTwo more ioctls (NIOCGINFO, NIOCREGIF) are used
52417885a7bSLuigi Rizzoto query and configure ports, with the following argument:
52568b8534bSLuigi Rizzo.Bd -literal
52668b8534bSLuigi Rizzostruct nmreq {
52717885a7bSLuigi Rizzo    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
52817885a7bSLuigi Rizzo    uint32_t  nr_version;        /* (i) API version                */
52917885a7bSLuigi Rizzo    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
53017885a7bSLuigi Rizzo    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
531fa7db06bSLuigi Rizzo    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
532fa7db06bSLuigi Rizzo    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
533fa7db06bSLuigi Rizzo    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
534*e91d04f7SChristian Brueffer    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
535fa7db06bSLuigi Rizzo    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
53617885a7bSLuigi Rizzo    uint16_t  nr_cmd;            /* (i) special command            */
537fa7db06bSLuigi Rizzo    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
538fa7db06bSLuigi Rizzo    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
539fa7db06bSLuigi Rizzo    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
540fa7db06bSLuigi Rizzo    uint32_t  nr_flags           /* (i/o) open mode                */
54117885a7bSLuigi Rizzo    ...
54268b8534bSLuigi Rizzo};
54368b8534bSLuigi Rizzo.Ed
54468b8534bSLuigi Rizzo.Pp
54517885a7bSLuigi RizzoA file descriptor obtained through
54617885a7bSLuigi Rizzo.Pa /dev/netmap
54717885a7bSLuigi Rizzoalso supports the ioctl supported by network devices, see
54817885a7bSLuigi Rizzo.Xr netintro 4 .
54968b8534bSLuigi Rizzo.Bl -tag -width XXXX
55068b8534bSLuigi Rizzo.It Dv NIOCGINFO
55117885a7bSLuigi Rizzoreturns EINVAL if the named port does not support netmap.
552ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information
55317885a7bSLuigi Rizzoabout the port.
554ce3ee1e7SLuigi RizzoNote that all the information below can change before the
555ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode.
55617885a7bSLuigi Rizzo.Bl -tag -width XX
55717885a7bSLuigi Rizzo.It Pa nr_memsize
55817885a7bSLuigi Rizzoindicates the size of the
55917885a7bSLuigi Rizzo.Nm
560*e91d04f7SChristian Brueffermemory region.
561*e91d04f7SChristian BruefferNICs in
56217885a7bSLuigi Rizzo.Nm
56317885a7bSLuigi Rizzomode all share the same memory region,
56417885a7bSLuigi Rizzowhereas
56517885a7bSLuigi Rizzo.Nm VALE
56617885a7bSLuigi Rizzoports have independent regions for each port.
56717885a7bSLuigi Rizzo.It Pa nr_tx_slots , nr_rx_slots
568ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings.
56917885a7bSLuigi Rizzo.It Pa nr_tx_rings , nr_rx_rings
570ce3ee1e7SLuigi Rizzoindicate the number of transmit
571ce3ee1e7SLuigi Rizzoand receive rings.
572ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime
573ce3ee1e7SLuigi Rizzousing interface-specific functions (e.g.
57417885a7bSLuigi Rizzo.Xr ethtool
57517885a7bSLuigi Rizzo).
57617885a7bSLuigi Rizzo.El
57768b8534bSLuigi Rizzo.It Dv NIOCREGIF
57817885a7bSLuigi Rizzobinds the port named in
57917885a7bSLuigi Rizzo.Va nr_name
580*e91d04f7SChristian Bruefferto the file descriptor.
581*e91d04f7SChristian BruefferFor a physical device this also switches it into
58217885a7bSLuigi Rizzo.Nm
58317885a7bSLuigi Rizzomode, disconnecting
58417885a7bSLuigi Rizzoit from the host stack.
58517885a7bSLuigi RizzoMultiple file descriptors can be bound to the same port,
58617885a7bSLuigi Rizzowith proper synchronization left to the user.
58717885a7bSLuigi Rizzo.Pp
588fa7db06bSLuigi Rizzo.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
589fa7db06bSLuigi Rizzo.Em netmap pipe ,
590fa7db06bSLuigi Rizzoconsisting of two netmap ports with a crossover connection.
591fa7db06bSLuigi RizzoA netmap pipe share the same memory space of the parent port,
592fa7db06bSLuigi Rizzoand is meant to enable configuration where a master process acts
593fa7db06bSLuigi Rizzoas a dispatcher towards slave processes.
594fa7db06bSLuigi Rizzo.Pp
595fa7db06bSLuigi RizzoTo enable this function, the
596fa7db06bSLuigi Rizzo.Pa nr_arg1
597fa7db06bSLuigi Rizzofield of the structure can be used as a hint to the kernel to
598fa7db06bSLuigi Rizzoindicate how many pipes we expect to use, and reserve extra space
599fa7db06bSLuigi Rizzoin the memory region.
600fa7db06bSLuigi Rizzo.Pp
601fa7db06bSLuigi RizzoOn return, it gives the same info as NIOCGINFO,
602fa7db06bSLuigi Rizzowith
603fa7db06bSLuigi Rizzo.Pa nr_ringid
604fa7db06bSLuigi Rizzoand
605fa7db06bSLuigi Rizzo.Pa nr_flags
606fa7db06bSLuigi Rizzoindicating the identity of the rings controlled through the file
60768b8534bSLuigi Rizzodescriptor.
60868b8534bSLuigi Rizzo.Pp
609fa7db06bSLuigi Rizzo.Va nr_flags
61017885a7bSLuigi Rizzo.Va nr_ringid
61117885a7bSLuigi Rizzoselects which rings are controlled through this file descriptor.
612fa7db06bSLuigi RizzoPossible values of
613fa7db06bSLuigi Rizzo.Pa nr_flags
614fa7db06bSLuigi Rizzoare indicated below, together with the naming schemes
615fa7db06bSLuigi Rizzothat application libraries (such as the
616fa7db06bSLuigi Rizzo.Nm nm_open
617fa7db06bSLuigi Rizzoindicated below) can use to indicate the specific set of rings.
618fa7db06bSLuigi RizzoIn the example below, "netmap:foo" is any valid netmap port name.
61968b8534bSLuigi Rizzo.Bl -tag -width XXXXX
620fa7db06bSLuigi Rizzo.It NR_REG_ALL_NIC                         "netmap:foo"
621fa7db06bSLuigi Rizzo(default) all hardware ring pairs
622415dfa83SMaxim Sobolev.It NR_REG_SW            "netmap:foo^"
62317885a7bSLuigi Rizzothe ``host rings'', connecting to the host stack.
624d4d112e3SJoel Dahl.It NR_REG_NIC_SW        "netmap:foo+"
625fa7db06bSLuigi Rizzoall hardware rings and the host rings
626fa7db06bSLuigi Rizzo.It NR_REG_ONE_NIC       "netmap:foo-i"
627fa7db06bSLuigi Rizzoonly the i-th hardware ring pair, where the number is in
628fa7db06bSLuigi Rizzo.Pa nr_ringid ;
629fa7db06bSLuigi Rizzo.It NR_REG_PIPE_MASTER  "netmap:foo{i"
630fa7db06bSLuigi Rizzothe master side of the netmap pipe whose identifier (i) is in
631fa7db06bSLuigi Rizzo.Pa nr_ringid ;
632fa7db06bSLuigi Rizzo.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
633fa7db06bSLuigi Rizzothe slave side of the netmap pipe whose identifier (i) is in
634fa7db06bSLuigi Rizzo.Pa nr_ringid .
635fa7db06bSLuigi Rizzo.Pp
636fa7db06bSLuigi RizzoThe identifier of a pipe must be thought as part of the pipe name,
637*e91d04f7SChristian Bruefferand does not need to be sequential.
638*e91d04f7SChristian BruefferOn return the pipe
639fa7db06bSLuigi Rizzowill only have a single ring pair with index 0,
640*e91d04f7SChristian Bruefferirrespective of the value of
641*e91d04f7SChristian Brueffer.Va i.
64268b8534bSLuigi Rizzo.El
64317885a7bSLuigi Rizzo.Pp
64468b8534bSLuigi RizzoBy default, a
64517885a7bSLuigi Rizzo.Xr poll 2
64668b8534bSLuigi Rizzoor
64717885a7bSLuigi Rizzo.Xr select 2
64868b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if
64968b8534bSLuigi Rizzono write events are specified.
65068b8534bSLuigi RizzoThe feature can be disabled by or-ing
651415dfa83SMaxim Sobolev.Va NETMAP_NO_TX_POLL
65217885a7bSLuigi Rizzoto the value written to
65317885a7bSLuigi Rizzo.Va nr_ringid.
65417885a7bSLuigi RizzoWhen this feature is used,
65517885a7bSLuigi Rizzopackets are transmitted only on
65617885a7bSLuigi Rizzo.Va ioctl(NIOCTXSYNC)
65717885a7bSLuigi Rizzoor select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
658ce3ee1e7SLuigi Rizzo.Pp
659ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a
660ce3ee1e7SLuigi Rizzo.Xr vale 4
661ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default,
662ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
66368b8534bSLuigi Rizzo.It Dv NIOCTXSYNC
66468b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the
66568b8534bSLuigi Rizzonumber of slots available for transmission.
66668b8534bSLuigi Rizzo.It Dv NIOCRXSYNC
66768b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available
66868b8534bSLuigi Rizzopackets.
66968b8534bSLuigi Rizzo.El
670fa7db06bSLuigi Rizzo.Sh SELECT, POLL, EPOLL, KQUEUE.
67117885a7bSLuigi Rizzo.Xr select 2
67217885a7bSLuigi Rizzoand
67317885a7bSLuigi Rizzo.Xr poll 2
67417885a7bSLuigi Rizzoon a
67517885a7bSLuigi Rizzo.Nm
67617885a7bSLuigi Rizzofile descriptor process rings as indicated in
67717885a7bSLuigi Rizzo.Sx TRANSMIT RINGS
67817885a7bSLuigi Rizzoand
679fa7db06bSLuigi Rizzo.Sx RECEIVE RINGS ,
680fa7db06bSLuigi Rizzorespectively when write (POLLOUT) and read (POLLIN) events are requested.
681fa7db06bSLuigi RizzoBoth block if no slots are available in the ring
682fa7db06bSLuigi Rizzo.Va ( ring->cur == ring->tail ) .
683fa7db06bSLuigi RizzoDepending on the platform,
684fa7db06bSLuigi Rizzo.Xr epoll 2
685fa7db06bSLuigi Rizzoand
686fa7db06bSLuigi Rizzo.Xr kqueue 2
687fa7db06bSLuigi Rizzoare supported too.
68817885a7bSLuigi Rizzo.Pp
689fa7db06bSLuigi RizzoPackets in transmit rings are normally pushed out
690fa7db06bSLuigi Rizzo(and buffers reclaimed) even without
691*e91d04f7SChristian Bruefferrequesting write events.
692*e91d04f7SChristian BruefferPassing the
693*e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL
694*e91d04f7SChristian Bruefferflag to
69517885a7bSLuigi Rizzo.Em NIOCREGIF
69617885a7bSLuigi Rizzodisables this feature.
697fa7db06bSLuigi RizzoBy default, receive rings are processed only if read
698*e91d04f7SChristian Bruefferevents are requested.
699*e91d04f7SChristian BruefferPassing the
700*e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL
701*e91d04f7SChristian Bruefferflag to
702fa7db06bSLuigi Rizzo.Em NIOCREGIF updates receive rings even without read events.
703*e91d04f7SChristian BruefferNote that on epoll and kqueue,
704*e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL
705*e91d04f7SChristian Bruefferand
706*e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL
707fa7db06bSLuigi Rizzoonly have an effect when some event is posted for the file descriptor.
70817885a7bSLuigi Rizzo.Sh LIBRARIES
70917885a7bSLuigi RizzoThe
71017885a7bSLuigi Rizzo.Nm
71117885a7bSLuigi RizzoAPI is supposed to be used directly, both because of its simplicity and
71217885a7bSLuigi Rizzofor efficient integration with applications.
71317885a7bSLuigi Rizzo.Pp
714*e91d04f7SChristian BruefferFor convenience, the
715*e91d04f7SChristian Brueffer.In net/netmap_user.h
71617885a7bSLuigi Rizzoheader provides a few macros and functions to ease creating
71717885a7bSLuigi Rizzoa file descriptor and doing I/O with a
71817885a7bSLuigi Rizzo.Nm
719*e91d04f7SChristian Bruefferport.
720*e91d04f7SChristian BruefferThese are loosely modeled after the
72117885a7bSLuigi Rizzo.Xr pcap 3
72217885a7bSLuigi RizzoAPI, to ease porting of libpcap-based applications to
72317885a7bSLuigi Rizzo.Nm .
72417885a7bSLuigi RizzoTo use these extra functions, programs should
72517885a7bSLuigi Rizzo.Dl #define NETMAP_WITH_LIBS
72617885a7bSLuigi Rizzobefore
72717885a7bSLuigi Rizzo.Dl #include <net/netmap_user.h>
72817885a7bSLuigi Rizzo.Pp
72917885a7bSLuigi RizzoThe following functions are available:
73017885a7bSLuigi Rizzo.Bl -tag -width XXXXX
731fa7db06bSLuigi Rizzo.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
73217885a7bSLuigi Rizzosimilar to
73317885a7bSLuigi Rizzo.Xr pcap_open ,
73417885a7bSLuigi Rizzobinds a file descriptor to a port.
73517885a7bSLuigi Rizzo.Bl -tag -width XX
73617885a7bSLuigi Rizzo.It Va ifname
73717885a7bSLuigi Rizzois a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
73817885a7bSLuigi Rizzo.Nm VALE
73917885a7bSLuigi Rizzoport.
740fa7db06bSLuigi Rizzo.It Va req
741fa7db06bSLuigi Rizzoprovides the initial values for the argument to the NIOCREGIF ioctl.
742fa7db06bSLuigi RizzoThe nm_flags and nm_ringid values are overwritten by parsing
743fa7db06bSLuigi Rizzoifname and flags, and other fields can be overridden through
744fa7db06bSLuigi Rizzothe other two arguments.
745fa7db06bSLuigi Rizzo.It Va arg
746fa7db06bSLuigi Rizzopoints to a struct nm_desc containing arguments (e.g. from a previously
747fa7db06bSLuigi Rizzoopen file descriptor) that should override the defaults.
748fa7db06bSLuigi RizzoThe fields are used as described below
74917885a7bSLuigi Rizzo.It Va flags
750fa7db06bSLuigi Rizzocan be set to a combination of the following flags:
751fa7db06bSLuigi Rizzo.Va NETMAP_NO_TX_POLL ,
752fa7db06bSLuigi Rizzo.Va NETMAP_DO_RX_POLL
753fa7db06bSLuigi Rizzo(copied into nr_ringid);
754fa7db06bSLuigi Rizzo.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
755fa7db06bSLuigi Rizzoavoids the mmap and uses the values from it);
756fa7db06bSLuigi Rizzo.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
757fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG1 ,
758fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG2 ,
759fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG3 (uses the fields from arg);
760fa7db06bSLuigi Rizzo.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
76117885a7bSLuigi Rizzo.El
762fa7db06bSLuigi Rizzo.It Va int nm_close(struct nm_desc *d)
76317885a7bSLuigi Rizzocloses the file descriptor, unmaps memory, frees resources.
764fa7db06bSLuigi Rizzo.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
76517885a7bSLuigi Rizzosimilar to pcap_inject(), pushes a packet to a ring, returns the size
76617885a7bSLuigi Rizzoof the packet is successful, or 0 on error;
767fa7db06bSLuigi Rizzo.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
76817885a7bSLuigi Rizzosimilar to pcap_dispatch(), applies a callback to incoming packets
769fa7db06bSLuigi Rizzo.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
77017885a7bSLuigi Rizzosimilar to pcap_next(), fetches the next packet
77117885a7bSLuigi Rizzo.El
77217885a7bSLuigi Rizzo.Sh SUPPORTED DEVICES
77317885a7bSLuigi Rizzo.Nm
77417885a7bSLuigi Rizzonatively supports the following devices:
77517885a7bSLuigi Rizzo.Pp
77617885a7bSLuigi RizzoOn FreeBSD:
77717885a7bSLuigi Rizzo.Xr em 4 ,
77817885a7bSLuigi Rizzo.Xr igb 4 ,
77917885a7bSLuigi Rizzo.Xr ixgbe 4 ,
78017885a7bSLuigi Rizzo.Xr lem 4 ,
78117885a7bSLuigi Rizzo.Xr re 4 .
78217885a7bSLuigi Rizzo.Pp
78317885a7bSLuigi RizzoOn Linux
78417885a7bSLuigi Rizzo.Xr e1000 4 ,
78517885a7bSLuigi Rizzo.Xr e1000e 4 ,
78617885a7bSLuigi Rizzo.Xr igb 4 ,
78717885a7bSLuigi Rizzo.Xr ixgbe 4 ,
78817885a7bSLuigi Rizzo.Xr mlx4 4 ,
78917885a7bSLuigi Rizzo.Xr forcedeth 4 ,
79017885a7bSLuigi Rizzo.Xr r8169 4 .
79117885a7bSLuigi Rizzo.Pp
79217885a7bSLuigi RizzoNICs without native support can still be used in
79317885a7bSLuigi Rizzo.Nm
794*e91d04f7SChristian Brueffermode through emulation.
795*e91d04f7SChristian BruefferPerformance is inferior to native netmap
79617885a7bSLuigi Rizzomode but still significantly higher than sockets, and approaching
79717885a7bSLuigi Rizzothat of in-kernel solutions such as Linux's
79817885a7bSLuigi Rizzo.Xr pktgen .
79917885a7bSLuigi Rizzo.Pp
80017885a7bSLuigi RizzoEmulation is also available for devices with native netmap support,
80117885a7bSLuigi Rizzowhich can be used for testing or performance comparison.
80217885a7bSLuigi RizzoThe sysctl variable
80317885a7bSLuigi Rizzo.Va dev.netmap.admode
80417885a7bSLuigi Rizzoglobally controls how netmap mode is implemented.
80517885a7bSLuigi Rizzo.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
80617885a7bSLuigi RizzoSome aspect of the operation of
80717885a7bSLuigi Rizzo.Nm
80817885a7bSLuigi Rizzoare controlled through sysctl variables on FreeBSD
80917885a7bSLuigi Rizzo.Em ( dev.netmap.* )
81017885a7bSLuigi Rizzoand module parameters on Linux
81117885a7bSLuigi Rizzo.Em ( /sys/module/netmap_lin/parameters/* ) :
81217885a7bSLuigi Rizzo.Bl -tag -width indent
81317885a7bSLuigi Rizzo.It Va dev.netmap.admode: 0
81417885a7bSLuigi RizzoControls the use of native or emulated adapter mode.
81517885a7bSLuigi Rizzo0 uses the best available option, 1 forces native and
81617885a7bSLuigi Rizzofails if not available, 2 forces emulated hence never fails.
81717885a7bSLuigi Rizzo.It Va dev.netmap.generic_ringsize: 1024
81817885a7bSLuigi RizzoRing size used for emulated netmap mode
81917885a7bSLuigi Rizzo.It Va dev.netmap.generic_mit: 100000
82017885a7bSLuigi RizzoControls interrupt moderation for emulated mode
82117885a7bSLuigi Rizzo.It Va dev.netmap.mmap_unreg: 0
82217885a7bSLuigi Rizzo.It Va dev.netmap.fwd: 0
82317885a7bSLuigi RizzoForces NS_FORWARD mode
82417885a7bSLuigi Rizzo.It Va dev.netmap.flags: 0
82517885a7bSLuigi Rizzo.It Va dev.netmap.txsync_retry: 2
82617885a7bSLuigi Rizzo.It Va dev.netmap.no_pendintr: 1
82717885a7bSLuigi RizzoForces recovery of transmit buffers on system calls
82817885a7bSLuigi Rizzo.It Va dev.netmap.mitigate: 1
82917885a7bSLuigi RizzoPropagates interrupt mitigation to user processes
83017885a7bSLuigi Rizzo.It Va dev.netmap.no_timestamp: 0
83117885a7bSLuigi RizzoDisables the update of the timestamp in the netmap ring
83217885a7bSLuigi Rizzo.It Va dev.netmap.verbose: 0
83317885a7bSLuigi RizzoVerbose kernel messages
83417885a7bSLuigi Rizzo.It Va dev.netmap.buf_num: 163840
83517885a7bSLuigi Rizzo.It Va dev.netmap.buf_size: 2048
83617885a7bSLuigi Rizzo.It Va dev.netmap.ring_num: 200
83717885a7bSLuigi Rizzo.It Va dev.netmap.ring_size: 36864
83817885a7bSLuigi Rizzo.It Va dev.netmap.if_num: 100
83917885a7bSLuigi Rizzo.It Va dev.netmap.if_size: 1024
84017885a7bSLuigi RizzoSizes and number of objects (netmap_if, netmap_ring, buffers)
841*e91d04f7SChristian Bruefferfor the global memory region.
842*e91d04f7SChristian BruefferThe only parameter worth modifying is
84317885a7bSLuigi Rizzo.Va dev.netmap.buf_num
84417885a7bSLuigi Rizzoas it impacts the total amount of memory used by netmap.
84517885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_num: 0
84617885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_size: 0
84717885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_num: 0
84817885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_size: 0
84917885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_num: 0
85017885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_size: 0
85117885a7bSLuigi RizzoActual values in use.
85217885a7bSLuigi Rizzo.It Va dev.netmap.bridge_batch: 1024
85317885a7bSLuigi RizzoBatch size used when moving packets across a
85417885a7bSLuigi Rizzo.Nm VALE
855*e91d04f7SChristian Bruefferswitch.
856*e91d04f7SChristian BruefferValues above 64 generally guarantee good
85717885a7bSLuigi Rizzoperformance.
85817885a7bSLuigi Rizzo.El
85913a5d88fSLuigi Rizzo.Sh SYSTEM CALLS
86068b8534bSLuigi Rizzo.Nm
86168b8534bSLuigi Rizzouses
862fa7db06bSLuigi Rizzo.Xr select 2 ,
863fa7db06bSLuigi Rizzo.Xr poll 2 ,
864fa7db06bSLuigi Rizzo.Xr epoll
86568b8534bSLuigi Rizzoand
866fa7db06bSLuigi Rizzo.Xr kqueue
867ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and
868ce3ee1e7SLuigi Rizzo.Xr mmap 2
869ce3ee1e7SLuigi Rizzoto map memory.
87017885a7bSLuigi Rizzo.Xr ioctl 2
87117885a7bSLuigi Rizzois used to configure ports and
87217885a7bSLuigi Rizzo.Nm VALE switches .
873ce3ee1e7SLuigi Rizzo.Pp
874ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to
875ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard
876ce3ee1e7SLuigi RizzoOS primitives, see
877ce3ee1e7SLuigi Rizzo.Xr pthread 3 .
878ce3ee1e7SLuigi RizzoIn particular,
879ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3
880ce3ee1e7SLuigi Rizzomay be of use.
88168b8534bSLuigi Rizzo.Sh EXAMPLES
88217885a7bSLuigi Rizzo.Ss TEST PROGRAMS
88317885a7bSLuigi Rizzo.Nm
88417885a7bSLuigi Rizzocomes with a few programs that can be used for testing or
88517885a7bSLuigi Rizzosimple applications.
88617885a7bSLuigi RizzoSee the
887*e91d04f7SChristian Brueffer.Pa examples/
88817885a7bSLuigi Rizzodirectory in
88917885a7bSLuigi Rizzo.Nm
89017885a7bSLuigi Rizzodistributions, or
891*e91d04f7SChristian Brueffer.Pa tools/tools/netmap/
892*e91d04f7SChristian Bruefferdirectory in
893*e91d04f7SChristian Brueffer.Fx
894*e91d04f7SChristian Bruefferdistributions.
89517885a7bSLuigi Rizzo.Pp
89617885a7bSLuigi Rizzo.Xr pkt-gen
89717885a7bSLuigi Rizzois a general purpose traffic source/sink.
89817885a7bSLuigi Rizzo.Pp
89917885a7bSLuigi RizzoAs an example
90017885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f tx -l 60
90117885a7bSLuigi Rizzocan generate an infinite stream of minimum size packets, and
90217885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f rx
90317885a7bSLuigi Rizzois a traffic sink.
90417885a7bSLuigi RizzoBoth print traffic statistics, to help monitor
90517885a7bSLuigi Rizzohow the system performs.
90617885a7bSLuigi Rizzo.Pp
90717885a7bSLuigi Rizzo.Xr pkt-gen
90817885a7bSLuigi Rizzohas many options can be uses to set packet sizes, addresses,
90917885a7bSLuigi Rizzorates, and use multiple send/receive threads and cores.
91017885a7bSLuigi Rizzo.Pp
91117885a7bSLuigi Rizzo.Xr bridge
91217885a7bSLuigi Rizzois another test program which interconnects two
91317885a7bSLuigi Rizzo.Nm
914*e91d04f7SChristian Bruefferports.
915*e91d04f7SChristian BruefferIt can be used for transparent forwarding between
91617885a7bSLuigi Rizzointerfaces, as in
91717885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix1
91817885a7bSLuigi Rizzoor even connect the NIC to the host stack using netmap
91917885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix0
92017885a7bSLuigi Rizzo.Ss USING THE NATIVE API
92168b8534bSLuigi RizzoThe following code implements a traffic generator
92268b8534bSLuigi Rizzo.Pp
92368b8534bSLuigi Rizzo.Bd -literal -compact
92468b8534bSLuigi Rizzo#include <net/netmap_user.h>
925fe1e4a6cSBaptiste Daroussin\&...
92617885a7bSLuigi Rizzovoid sender(void)
92717885a7bSLuigi Rizzo{
92868b8534bSLuigi Rizzo    struct netmap_if *nifp;
92968b8534bSLuigi Rizzo    struct netmap_ring *ring;
930d83a410eSHiren Panchasara    struct nmreq nmr;
93117885a7bSLuigi Rizzo    struct pollfd fds;
93268b8534bSLuigi Rizzo
93368b8534bSLuigi Rizzo    fd = open("/dev/netmap", O_RDWR);
93468b8534bSLuigi Rizzo    bzero(&nmr, sizeof(nmr));
935d83a410eSHiren Panchasara    strcpy(nmr.nr_name, "ix0");
936ce3ee1e7SLuigi Rizzo    nmr.nm_version = NETMAP_API;
937ce3ee1e7SLuigi Rizzo    ioctl(fd, NIOCREGIF, &nmr);
938d83a410eSHiren Panchasara    p = mmap(0, nmr.nr_memsize, fd);
939ce3ee1e7SLuigi Rizzo    nifp = NETMAP_IF(p, nmr.nr_offset);
94068b8534bSLuigi Rizzo    ring = NETMAP_TXRING(nifp, 0);
94168b8534bSLuigi Rizzo    fds.fd = fd;
94268b8534bSLuigi Rizzo    fds.events = POLLOUT;
94368b8534bSLuigi Rizzo    for (;;) {
94417885a7bSLuigi Rizzo	poll(&fds, 1, -1);
94517885a7bSLuigi Rizzo	while (!nm_ring_empty(ring)) {
94668b8534bSLuigi Rizzo	    i = ring->cur;
94768b8534bSLuigi Rizzo	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
94868b8534bSLuigi Rizzo	    ... prepare packet in buf ...
94968b8534bSLuigi Rizzo	    ring->slot[i].len = ... packet length ...
95017885a7bSLuigi Rizzo	    ring->head = ring->cur = nm_ring_next(ring, i);
95117885a7bSLuigi Rizzo	}
95268b8534bSLuigi Rizzo    }
95368b8534bSLuigi Rizzo}
95468b8534bSLuigi Rizzo.Ed
95517885a7bSLuigi Rizzo.Ss HELPER FUNCTIONS
95617885a7bSLuigi RizzoA simple receiver can be implemented using the helper functions
95717885a7bSLuigi Rizzo.Bd -literal -compact
95817885a7bSLuigi Rizzo#define NETMAP_WITH_LIBS
95917885a7bSLuigi Rizzo#include <net/netmap_user.h>
960fe1e4a6cSBaptiste Daroussin\&...
96117885a7bSLuigi Rizzovoid receiver(void)
96217885a7bSLuigi Rizzo{
963fa7db06bSLuigi Rizzo    struct nm_desc *d;
96417885a7bSLuigi Rizzo    struct pollfd fds;
96517885a7bSLuigi Rizzo    u_char *buf;
966fa7db06bSLuigi Rizzo    struct nm_pkthdr h;
96717885a7bSLuigi Rizzo    ...
96817885a7bSLuigi Rizzo    d = nm_open("netmap:ix0", NULL, 0, 0);
96917885a7bSLuigi Rizzo    fds.fd = NETMAP_FD(d);
97017885a7bSLuigi Rizzo    fds.events = POLLIN;
97117885a7bSLuigi Rizzo    for (;;) {
97217885a7bSLuigi Rizzo	poll(&fds, 1, -1);
97317885a7bSLuigi Rizzo        while ( (buf = nm_nextpkt(d, &h)) )
97417885a7bSLuigi Rizzo	    consume_pkt(buf, h->len);
97517885a7bSLuigi Rizzo    }
97617885a7bSLuigi Rizzo    nm_close(d);
97717885a7bSLuigi Rizzo}
97817885a7bSLuigi Rizzo.Ed
97917885a7bSLuigi Rizzo.Ss ZERO-COPY FORWARDING
98017885a7bSLuigi RizzoSince physical interfaces share the same memory region,
98117885a7bSLuigi Rizzoit is possible to do packet forwarding between ports
982*e91d04f7SChristian Bruefferswapping buffers.
983*e91d04f7SChristian BruefferThe buffer from the transmit ring is used
98417885a7bSLuigi Rizzoto replenish the receive ring:
98517885a7bSLuigi Rizzo.Bd -literal -compact
98617885a7bSLuigi Rizzo    uint32_t tmp;
98717885a7bSLuigi Rizzo    struct netmap_slot *src, *dst;
98817885a7bSLuigi Rizzo    ...
98917885a7bSLuigi Rizzo    src = &src_ring->slot[rxr->cur];
99017885a7bSLuigi Rizzo    dst = &dst_ring->slot[txr->cur];
99117885a7bSLuigi Rizzo    tmp = dst->buf_idx;
99217885a7bSLuigi Rizzo    dst->buf_idx = src->buf_idx;
99317885a7bSLuigi Rizzo    dst->len = src->len;
99417885a7bSLuigi Rizzo    dst->flags = NS_BUF_CHANGED;
99517885a7bSLuigi Rizzo    src->buf_idx = tmp;
99617885a7bSLuigi Rizzo    src->flags = NS_BUF_CHANGED;
99717885a7bSLuigi Rizzo    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
99817885a7bSLuigi Rizzo    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
99917885a7bSLuigi Rizzo    ...
100017885a7bSLuigi Rizzo.Ed
100117885a7bSLuigi Rizzo.Ss ACCESSING THE HOST STACK
1002fa7db06bSLuigi RizzoThe host stack is for all practical purposes just a regular ring pair,
1003fa7db06bSLuigi Rizzowhich you can access with the netmap API (e.g. with
1004fa7db06bSLuigi Rizzo.Dl nm_open("netmap:eth0^", ... ) ;
1005fa7db06bSLuigi RizzoAll packets that the host would send to an interface in
1006fa7db06bSLuigi Rizzo.Nm
1007fa7db06bSLuigi Rizzomode end up into the RX ring, whereas all packets queued to the
1008fa7db06bSLuigi RizzoTX ring are send up to the host stack.
100917885a7bSLuigi Rizzo.Ss VALE SWITCH
101017885a7bSLuigi RizzoA simple way to test the performance of a
101117885a7bSLuigi Rizzo.Nm VALE
101217885a7bSLuigi Rizzoswitch is to attach a sender and a receiver to it,
101317885a7bSLuigi Rizzoe.g. running the following in two different terminals:
101417885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:a -f rx # receiver
101517885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:b -f tx # sender
1016fa7db06bSLuigi RizzoThe same example can be used to test netmap pipes, by simply
1017fa7db06bSLuigi Rizzochanging port names, e.g.
1018fa7db06bSLuigi Rizzo.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
1019fa7db06bSLuigi Rizzo.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
102017885a7bSLuigi Rizzo.Pp
102117885a7bSLuigi RizzoThe following command attaches an interface and the host stack
102217885a7bSLuigi Rizzoto a switch:
102317885a7bSLuigi Rizzo.Dl vale-ctl -h vale2:em0
102417885a7bSLuigi RizzoOther
102568b8534bSLuigi Rizzo.Nm
102617885a7bSLuigi Rizzoclients attached to the same switch can now communicate
102717885a7bSLuigi Rizzowith the network card or the host.
102813a5d88fSLuigi Rizzo.Sh SEE ALSO
10290b3504fdSChristian Brueffer.Pa http://info.iet.unipi.it/~luigi/netmap/
103013a5d88fSLuigi Rizzo.Pp
103113a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework,
103213a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012
103313a5d88fSLuigi Rizzo.Pp
103413a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O,
103513a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston
1036fa7db06bSLuigi Rizzo.Pp
1037fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri,
1038fa7db06bSLuigi RizzoVALE, a switched ethernet for virtual machines,
1039fa7db06bSLuigi RizzoACM CoNEXT'12, December 2012, Nice
1040fa7db06bSLuigi Rizzo.Pp
1041fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1042fa7db06bSLuigi RizzoSpeeding up packet I/O in virtual machines,
1043fa7db06bSLuigi RizzoACM/IEEE ANCS'13, October 2013, San Jose
104468b8534bSLuigi Rizzo.Sh AUTHORS
104513a5d88fSLuigi Rizzo.An -nosplit
104668b8534bSLuigi RizzoThe
104768b8534bSLuigi Rizzo.Nm
1048ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the
104913a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by
105013a5d88fSLuigi Rizzo.An Luigi Rizzo ,
1051ce3ee1e7SLuigi Rizzoand further extended with help from
105213a5d88fSLuigi Rizzo.An Matteo Landi ,
105313a5d88fSLuigi Rizzo.An Gaetano Catalli ,
1054ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri ,
1055*e91d04f7SChristian Bruefferand
1056ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione .
105713a5d88fSLuigi Rizzo.Pp
105813a5d88fSLuigi Rizzo.Nm
1059ce3ee1e7SLuigi Rizzoand
1060ce3ee1e7SLuigi Rizzo.Nm VALE
1061ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects
1062ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581).
1063bf15fc88SJoel Dahl.Sh CAVEATS
1064bf15fc88SJoel DahlNo matter how fast the CPU and OS are,
1065bf15fc88SJoel Dahlachieving line rate on 10G and faster interfaces
1066bf15fc88SJoel Dahlrequires hardware with sufficient performance.
1067bf15fc88SJoel DahlSeveral NICs are unable to sustain line rate with
1068*e91d04f7SChristian Brueffersmall packet sizes.
1069*e91d04f7SChristian BruefferInsufficient PCIe or memory bandwidth
1070bf15fc88SJoel Dahlcan also cause reduced performance.
1071bf15fc88SJoel Dahl.Pp
1072bf15fc88SJoel DahlAnother frequent reason for low performance is the use
1073bf15fc88SJoel Dahlof flow control on the link: a slow receiver can limit
1074bf15fc88SJoel Dahlthe transmit speed.
1075bf15fc88SJoel DahlBe sure to disable flow control when running high
1076bf15fc88SJoel Dahlspeed experiments.
1077bf15fc88SJoel Dahl.Ss SPECIAL NIC FEATURES
1078bf15fc88SJoel Dahl.Nm
1079bf15fc88SJoel Dahlis orthogonal to some NIC features such as
1080bf15fc88SJoel Dahlmultiqueue, schedulers, packet filters.
1081bf15fc88SJoel Dahl.Pp
1082bf15fc88SJoel DahlMultiple transmit and receive rings are supported natively
1083bf15fc88SJoel Dahland can be configured with ordinary OS tools,
1084bf15fc88SJoel Dahlsuch as
1085bf15fc88SJoel Dahl.Xr ethtool
1086bf15fc88SJoel Dahlor
1087bf15fc88SJoel Dahldevice-specific sysctl variables.
1088bf15fc88SJoel DahlThe same goes for Receive Packet Steering (RPS)
1089bf15fc88SJoel Dahland filtering of incoming traffic.
1090bf15fc88SJoel Dahl.Pp
1091bf15fc88SJoel Dahl.Nm
1092bf15fc88SJoel Dahl.Em does not use
1093bf15fc88SJoel Dahlfeatures such as
1094bf15fc88SJoel Dahl.Em checksum offloading , TCP segmentation offloading ,
1095bf15fc88SJoel Dahl.Em encryption , VLAN encapsulation/decapsulation ,
1096*e91d04f7SChristian Bruefferetc.
1097bf15fc88SJoel DahlWhen using netmap to exchange packets with the host stack,
1098bf15fc88SJoel Dahlmake sure to disable these features.
1099