xref: /freebsd/share/man/man4/netmap.4 (revision 1c37b63fb637010fcb530e213013e22e8882c705)
117885a7bSLuigi Rizzo.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
268b8534bSLuigi Rizzo.\" All rights reserved.
368b8534bSLuigi Rizzo.\"
468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without
568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions
668b8534bSLuigi Rizzo.\" are met:
768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright
868b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer.
968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright
1068b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer in the
1168b8534bSLuigi Rizzo.\"    documentation and/or other materials provided with the distribution.
1268b8534bSLuigi Rizzo.\"
1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
2368b8534bSLuigi Rizzo.\" SUCH DAMAGE.
2468b8534bSLuigi Rizzo.\"
2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4)
2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix.
2768b8534bSLuigi Rizzo.\"
2868b8534bSLuigi Rizzo.\" $FreeBSD$
2968b8534bSLuigi Rizzo.\"
30*1c37b63fSGlen Barber.Dd November 8, 2019
3168b8534bSLuigi Rizzo.Dt NETMAP 4
3268b8534bSLuigi Rizzo.Os
3368b8534bSLuigi Rizzo.Sh NAME
3468b8534bSLuigi Rizzo.Nm netmap
3568b8534bSLuigi Rizzo.Nd a framework for fast packet I/O
3668b8534bSLuigi Rizzo.Sh SYNOPSIS
3768b8534bSLuigi Rizzo.Cd device netmap
3868b8534bSLuigi Rizzo.Sh DESCRIPTION
3968b8534bSLuigi Rizzo.Nm
40ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O
4137e3a6d3SLuigi Rizzofor userspace and kernel clients, and for Virtual Machines.
42e91d04f7SChristian BruefferIt runs on
43966b76b4SMateusz Piotrowski.Fx ,
4437e3a6d3SLuigi RizzoLinux and some versions of Windows, and supports a variety of
4537e3a6d3SLuigi Rizzo.Nm netmap ports ,
4637e3a6d3SLuigi Rizzoincluding
4737e3a6d3SLuigi Rizzo.Bl -tag -width XXXX
4837e3a6d3SLuigi Rizzo.It Nm physical NIC ports
4937e3a6d3SLuigi Rizzoto access individual queues of network interfaces;
5037e3a6d3SLuigi Rizzo.It Nm host ports
5137e3a6d3SLuigi Rizzoto inject packets into the host stack;
5237e3a6d3SLuigi Rizzo.It Nm VALE ports
5337e3a6d3SLuigi Rizzoimplementing a very fast and modular in-kernel software switch/dataplane;
5437e3a6d3SLuigi Rizzo.It Nm netmap pipes
5537e3a6d3SLuigi Rizzoa shared memory packet transport channel;
5637e3a6d3SLuigi Rizzo.It Nm netmap monitors
5737e3a6d3SLuigi Rizzoa mechanism similar to
583f879a47SEnji Cooper.Xr bpf 4
5937e3a6d3SLuigi Rizzoto capture traffic
6037e3a6d3SLuigi Rizzo.El
61fa7db06bSLuigi Rizzo.Pp
6237e3a6d3SLuigi RizzoAll these
6337e3a6d3SLuigi Rizzo.Nm netmap ports
6437e3a6d3SLuigi Rizzoare accessed interchangeably with the same API,
6537e3a6d3SLuigi Rizzoand are at least one order of magnitude faster than
66fa7db06bSLuigi Rizzostandard OS mechanisms
6737e3a6d3SLuigi Rizzo(sockets, bpf, tun/tap interfaces, native switches, pipes).
6837e3a6d3SLuigi RizzoWith suitably fast hardware (NICs, PCIe buses, CPUs),
6937e3a6d3SLuigi Rizzopacket I/O using
7037e3a6d3SLuigi Rizzo.Nm
7137e3a6d3SLuigi Rizzoon supported NICs
7237e3a6d3SLuigi Rizzoreaches 14.88 million packets per second (Mpps)
7337e3a6d3SLuigi Rizzowith much less than one core on 10 Gbit/s NICs;
7437e3a6d3SLuigi Rizzo35-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
7537e3a6d3SLuigi Rizzoabout 20 Mpps per core for VALE ports;
7637e3a6d3SLuigi Rizzoand over 100 Mpps for
7737e3a6d3SLuigi Rizzo.Nm netmap pipes .
7837e3a6d3SLuigi RizzoNICs without native
7937e3a6d3SLuigi Rizzo.Nm
8037e3a6d3SLuigi Rizzosupport can still use the API in emulated mode,
8137e3a6d3SLuigi Rizzowhich uses unmodified device drivers and is 3-5 times faster than
823f879a47SEnji Cooper.Xr bpf 4
8337e3a6d3SLuigi Rizzoor raw sockets.
84ce3ee1e7SLuigi Rizzo.Pp
8517885a7bSLuigi RizzoUserspace clients can dynamically switch NICs into
8668b8534bSLuigi Rizzo.Nm
8717885a7bSLuigi Rizzomode and send and receive raw packets through
8817885a7bSLuigi Rizzomemory mapped buffers.
8917885a7bSLuigi RizzoSimilarly,
9017885a7bSLuigi Rizzo.Nm VALE
9137e3a6d3SLuigi Rizzoswitch instances and ports,
92fa7db06bSLuigi Rizzo.Nm netmap pipes
9337e3a6d3SLuigi Rizzoand
9437e3a6d3SLuigi Rizzo.Nm netmap monitors
95fa7db06bSLuigi Rizzocan be created dynamically,
9617885a7bSLuigi Rizzoproviding high speed packet I/O between processes,
9717885a7bSLuigi Rizzovirtual machines, NICs and the host stack.
9817885a7bSLuigi Rizzo.Pp
99fa7db06bSLuigi Rizzo.Nm
100e91d04f7SChristian Brueffersupports both non-blocking I/O through
101e91d04f7SChristian Brueffer.Xr ioctl 2 ,
102fa7db06bSLuigi Rizzosynchronization and blocking I/O through a file descriptor
103fa7db06bSLuigi Rizzoand standard OS mechanisms such as
104fa7db06bSLuigi Rizzo.Xr select 2 ,
105fa7db06bSLuigi Rizzo.Xr poll 2 ,
106668e070fSVincenzo Maffione.Xr kqueue 2
107e91d04f7SChristian Bruefferand
108668e070fSVincenzo Maffione.Xr epoll 7 .
10937e3a6d3SLuigi RizzoAll types of
11037e3a6d3SLuigi Rizzo.Nm netmap ports
11137e3a6d3SLuigi Rizzoand the
11237e3a6d3SLuigi Rizzo.Nm VALE switch
113fa7db06bSLuigi Rizzoare implemented by a single kernel module, which also emulates the
114fa7db06bSLuigi Rizzo.Nm
11537e3a6d3SLuigi RizzoAPI over standard drivers.
11617885a7bSLuigi RizzoFor best performance,
11768b8534bSLuigi Rizzo.Nm
11837e3a6d3SLuigi Rizzorequires native support in device drivers.
11937e3a6d3SLuigi RizzoA list of such devices is at the end of this document.
120ce3ee1e7SLuigi Rizzo.Pp
12117885a7bSLuigi RizzoIn the rest of this (long) manual page we document
12217885a7bSLuigi Rizzovarious aspects of the
123ce3ee1e7SLuigi Rizzo.Nm
12417885a7bSLuigi Rizzoand
125ce3ee1e7SLuigi Rizzo.Nm VALE
12617885a7bSLuigi Rizzoarchitecture, features and usage.
12717885a7bSLuigi Rizzo.Sh ARCHITECTURE
12817885a7bSLuigi Rizzo.Nm
12917885a7bSLuigi Rizzosupports raw packet I/O through a
13017885a7bSLuigi Rizzo.Em port ,
13117885a7bSLuigi Rizzowhich can be connected to a physical interface
13217885a7bSLuigi Rizzo.Em ( NIC ) ,
13317885a7bSLuigi Rizzoto the host stack,
13417885a7bSLuigi Rizzoor to a
13517885a7bSLuigi Rizzo.Nm VALE
13637e3a6d3SLuigi Rizzoswitch.
13717885a7bSLuigi RizzoPorts use preallocated circular queues of buffers
13817885a7bSLuigi Rizzo.Em ( rings )
13917885a7bSLuigi Rizzoresiding in an mmapped region.
14017885a7bSLuigi RizzoThere is one ring for each transmit/receive queue of a
14117885a7bSLuigi RizzoNIC or virtual port.
14217885a7bSLuigi RizzoAn additional ring pair connects to the host stack.
143ce3ee1e7SLuigi Rizzo.Pp
14417885a7bSLuigi RizzoAfter binding a file descriptor to a port, a
14517885a7bSLuigi Rizzo.Nm
14617885a7bSLuigi Rizzoclient can send or receive packets in batches through
14717885a7bSLuigi Rizzothe rings, and possibly implement zero-copy forwarding
14817885a7bSLuigi Rizzobetween ports.
149ce3ee1e7SLuigi Rizzo.Pp
15017885a7bSLuigi RizzoAll NICs operating in
15168b8534bSLuigi Rizzo.Nm
152ce3ee1e7SLuigi Rizzomode use the same memory region,
15317885a7bSLuigi Rizzoaccessible to all processes who own
154e91d04f7SChristian Brueffer.Pa /dev/netmap
15517885a7bSLuigi Rizzofile descriptors bound to NICs.
156fa7db06bSLuigi RizzoIndependent
15717885a7bSLuigi Rizzo.Nm VALE
158fa7db06bSLuigi Rizzoand
159fa7db06bSLuigi Rizzo.Nm netmap pipe
160fa7db06bSLuigi Rizzoports
161fa7db06bSLuigi Rizzoby default use separate memory regions,
162fa7db06bSLuigi Rizzobut can be independently configured to share memory.
16317885a7bSLuigi Rizzo.Sh ENTERING AND EXITING NETMAP MODE
164fa7db06bSLuigi RizzoThe following section describes the system calls to create
165fa7db06bSLuigi Rizzoand control
166fa7db06bSLuigi Rizzo.Nm netmap
167fa7db06bSLuigi Rizzoports (including
168fa7db06bSLuigi Rizzo.Nm VALE
169fa7db06bSLuigi Rizzoand
170fa7db06bSLuigi Rizzo.Nm netmap pipe
171fa7db06bSLuigi Rizzoports).
1723f879a47SEnji CooperSimpler, higher level functions are described in the
1733f879a47SEnji Cooper.Sx LIBRARIES
1743f879a47SEnji Coopersection.
175fa7db06bSLuigi Rizzo.Pp
17617885a7bSLuigi RizzoPorts and rings are created and controlled through a file descriptor,
17717885a7bSLuigi Rizzocreated by opening a special device
17817885a7bSLuigi Rizzo.Dl fd = open("/dev/netmap");
17917885a7bSLuigi Rizzoand then bound to a specific port with an
18017885a7bSLuigi Rizzo.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
18117885a7bSLuigi Rizzo.Pp
18217885a7bSLuigi Rizzo.Nm
18317885a7bSLuigi Rizzohas multiple modes of operation controlled by the
18417885a7bSLuigi Rizzo.Vt struct nmreq
18517885a7bSLuigi Rizzoargument.
18617885a7bSLuigi Rizzo.Va arg.nr_name
18737e3a6d3SLuigi Rizzospecifies the netmap port name, as follows:
18817885a7bSLuigi Rizzo.Bl -tag -width XXXX
1893f879a47SEnji Cooper.It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
19017885a7bSLuigi Rizzothe data path of the NIC is disconnected from the host stack,
19117885a7bSLuigi Rizzoand the file descriptor is bound to the NIC (one or all queues),
19217885a7bSLuigi Rizzoor to the host stack;
19337e3a6d3SLuigi Rizzo.It Dv valeSSS:PPP
19437e3a6d3SLuigi Rizzothe file descriptor is bound to port PPP of VALE switch SSS.
19537e3a6d3SLuigi RizzoSwitch instances and ports are dynamically created if necessary.
1963f879a47SEnji Cooper.Pp
19737e3a6d3SLuigi RizzoBoth SSS and PPP have the form [0-9a-zA-Z_]+ , the string
19837e3a6d3SLuigi Rizzocannot exceed IFNAMSIZ characters, and PPP cannot
19917885a7bSLuigi Rizzobe the name of any existing OS network interface.
20017885a7bSLuigi Rizzo.El
20117885a7bSLuigi Rizzo.Pp
20217885a7bSLuigi RizzoOn return,
20317885a7bSLuigi Rizzo.Va arg
20417885a7bSLuigi Rizzoindicates the size of the shared memory region,
20517885a7bSLuigi Rizzoand the number, size and location of all the
20617885a7bSLuigi Rizzo.Nm
20717885a7bSLuigi Rizzodata structures, which can be accessed by mmapping the memory
20817885a7bSLuigi Rizzo.Dl char *mem = mmap(0, arg.nr_memsize, fd);
20917885a7bSLuigi Rizzo.Pp
210e91d04f7SChristian BruefferNon-blocking I/O is done with special
21117885a7bSLuigi Rizzo.Xr ioctl 2
21217885a7bSLuigi Rizzo.Xr select 2
21317885a7bSLuigi Rizzoand
21417885a7bSLuigi Rizzo.Xr poll 2
21517885a7bSLuigi Rizzoon the file descriptor permit blocking I/O.
21617885a7bSLuigi Rizzo.Pp
21717885a7bSLuigi RizzoWhile a NIC is in
21817885a7bSLuigi Rizzo.Nm
21917885a7bSLuigi Rizzomode, the OS will still believe the interface is up and running.
22017885a7bSLuigi RizzoOS-generated packets for that NIC end up into a
22117885a7bSLuigi Rizzo.Nm
22217885a7bSLuigi Rizzoring, and another ring is used to send packets into the OS network stack.
22317885a7bSLuigi RizzoA
22417885a7bSLuigi Rizzo.Xr close 2
22517885a7bSLuigi Rizzoon the file descriptor removes the binding,
22617885a7bSLuigi Rizzoand returns the NIC to normal mode (reconnecting the data path
22717885a7bSLuigi Rizzoto the host stack), or destroys the virtual port.
22817885a7bSLuigi Rizzo.Sh DATA STRUCTURES
22917885a7bSLuigi RizzoThe data structures in the mmapped memory region are detailed in
230e91d04f7SChristian Brueffer.In sys/net/netmap.h ,
23117885a7bSLuigi Rizzowhich is the ultimate reference for the
23217885a7bSLuigi Rizzo.Nm
233e91d04f7SChristian BruefferAPI.
234e91d04f7SChristian BruefferThe main structures and fields are indicated below:
23568b8534bSLuigi Rizzo.Bl -tag -width XXX
23668b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface )
23768b8534bSLuigi Rizzo.Bd -literal
23868b8534bSLuigi Rizzostruct netmap_if {
23917885a7bSLuigi Rizzo    ...
24017885a7bSLuigi Rizzo    const uint32_t   ni_flags;      /* properties              */
24117885a7bSLuigi Rizzo    ...
24217885a7bSLuigi Rizzo    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
24317885a7bSLuigi Rizzo    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
244fa7db06bSLuigi Rizzo    uint32_t         ni_bufs_head;  /* head of extra bufs list */
24517885a7bSLuigi Rizzo    ...
24668b8534bSLuigi Rizzo};
24768b8534bSLuigi Rizzo.Ed
248ce3ee1e7SLuigi Rizzo.Pp
24917885a7bSLuigi RizzoIndicates the number of available rings
25017885a7bSLuigi Rizzo.Pa ( struct netmap_rings )
25117885a7bSLuigi Rizzoand their position in the mmapped region.
25217885a7bSLuigi RizzoThe number of tx and rx rings
25317885a7bSLuigi Rizzo.Pa ( ni_tx_rings , ni_rx_rings )
25417885a7bSLuigi Rizzonormally depends on the hardware.
25517885a7bSLuigi RizzoNICs also have an extra tx/rx ring pair connected to the host stack.
25617885a7bSLuigi Rizzo.Em NIOCREGIF
257fa7db06bSLuigi Rizzocan also request additional unbound buffers in the same memory space,
258fa7db06bSLuigi Rizzoto be used as temporary storage for packets.
259668e070fSVincenzo MaffioneThe number of extra
260668e070fSVincenzo Maffionebuffers is specified in the
261668e070fSVincenzo Maffione.Va arg.nr_arg3
262668e070fSVincenzo Maffionefield.
263668e070fSVincenzo MaffioneOn success, the kernel writes back to
264668e070fSVincenzo Maffione.Va arg.nr_arg3
265668e070fSVincenzo Maffionethe number of extra buffers actually allocated (they may be less
266668e070fSVincenzo Maffionethan the amount requested if the memory space ran out of buffers).
267fa7db06bSLuigi Rizzo.Pa ni_bufs_head
268668e070fSVincenzo Maffionecontains the index of the first of these extra buffers,
269fa7db06bSLuigi Rizzowhich are connected in a list (the first uint32_t of each
270fa7db06bSLuigi Rizzobuffer being the index of the next buffer in the list).
271e91d04f7SChristian BruefferA
272e91d04f7SChristian Brueffer.Dv 0
273e91d04f7SChristian Bruefferindicates the end of the list.
274668e070fSVincenzo MaffioneThe application is free to modify
275668e070fSVincenzo Maffionethis list and use the buffers (i.e., binding them to the slots of a
276668e070fSVincenzo Maffionenetmap ring).
277668e070fSVincenzo MaffioneWhen closing the netmap file descriptor,
278668e070fSVincenzo Maffionethe kernel frees the buffers contained in the list pointed by
279668e070fSVincenzo Maffione.Pa ni_bufs_head
280668e070fSVincenzo Maffione, irrespectively of the buffers originally provided by the kernel on
281668e070fSVincenzo Maffione.Em NIOCREGIF .
28217885a7bSLuigi Rizzo.It Dv struct netmap_ring (one per ring )
28368b8534bSLuigi Rizzo.Bd -literal
28468b8534bSLuigi Rizzostruct netmap_ring {
28517885a7bSLuigi Rizzo    ...
28617885a7bSLuigi Rizzo    const uint32_t num_slots;   /* slots in each ring            */
28717885a7bSLuigi Rizzo    const uint32_t nr_buf_size; /* size of each buffer           */
28817885a7bSLuigi Rizzo    ...
28917885a7bSLuigi Rizzo    uint32_t       head;        /* (u) first buf owned by user   */
29017885a7bSLuigi Rizzo    uint32_t       cur;         /* (u) wakeup position           */
29117885a7bSLuigi Rizzo    const uint32_t tail;        /* (k) first buf owned by kernel */
29217885a7bSLuigi Rizzo    ...
29317885a7bSLuigi Rizzo    uint32_t       flags;
29417885a7bSLuigi Rizzo    struct timeval ts;          /* (k) time of last rxsync()     */
29517885a7bSLuigi Rizzo    ...
296ce3ee1e7SLuigi Rizzo    struct netmap_slot slot[0]; /* array of slots                */
29768b8534bSLuigi Rizzo}
29868b8534bSLuigi Rizzo.Ed
299ce3ee1e7SLuigi Rizzo.Pp
30017885a7bSLuigi RizzoImplements transmit and receive rings, with read/write
301e91d04f7SChristian Bruefferpointers, metadata and an array of
302e91d04f7SChristian Brueffer.Em slots
30317885a7bSLuigi Rizzodescribing the buffers.
30417885a7bSLuigi Rizzo.It Dv struct netmap_slot (one per buffer )
30568b8534bSLuigi Rizzo.Bd -literal
30668b8534bSLuigi Rizzostruct netmap_slot {
30768b8534bSLuigi Rizzo    uint32_t buf_idx;           /* buffer index                 */
30868b8534bSLuigi Rizzo    uint16_t len;               /* packet length                */
30968b8534bSLuigi Rizzo    uint16_t flags;             /* buf changed, etc.            */
31017885a7bSLuigi Rizzo    uint64_t ptr;               /* address for indirect buffers */
31168b8534bSLuigi Rizzo};
31268b8534bSLuigi Rizzo.Ed
31317885a7bSLuigi Rizzo.Pp
31417885a7bSLuigi RizzoDescribes a packet buffer, which normally is identified by
31517885a7bSLuigi Rizzoan index and resides in the mmapped region.
31668b8534bSLuigi Rizzo.It Dv packet buffers
31717885a7bSLuigi RizzoFixed size (normally 2 KB) packet buffers allocated by the kernel.
318ce3ee1e7SLuigi Rizzo.El
319ce3ee1e7SLuigi Rizzo.Pp
32017885a7bSLuigi RizzoThe offset of the
32117885a7bSLuigi Rizzo.Pa struct netmap_if
32217885a7bSLuigi Rizzoin the mmapped region is indicated by the
32317885a7bSLuigi Rizzo.Pa nr_offset
32417885a7bSLuigi Rizzofield in the structure returned by
325e91d04f7SChristian Brueffer.Dv NIOCREGIF .
32617885a7bSLuigi RizzoFrom there, all other objects are reachable through
32717885a7bSLuigi Rizzorelative references (offsets or indexes).
328e91d04f7SChristian BruefferMacros and functions in
329e91d04f7SChristian Brueffer.In net/netmap_user.h
33017885a7bSLuigi Rizzohelp converting them into actual pointers:
33117885a7bSLuigi Rizzo.Pp
33217885a7bSLuigi Rizzo.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
33317885a7bSLuigi Rizzo.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
33417885a7bSLuigi Rizzo.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
33517885a7bSLuigi Rizzo.Pp
33617885a7bSLuigi Rizzo.Dl char *buf = NETMAP_BUF(ring, buffer_index);
33717885a7bSLuigi Rizzo.Sh RINGS, BUFFERS AND DATA I/O
33817885a7bSLuigi Rizzo.Va Rings
33917885a7bSLuigi Rizzoare circular queues of packets with three indexes/pointers
34017885a7bSLuigi Rizzo.Va ( head , cur , tail ) ;
34117885a7bSLuigi Rizzoone slot is always kept empty.
34217885a7bSLuigi RizzoThe ring size
34317885a7bSLuigi Rizzo.Va ( num_slots )
34417885a7bSLuigi Rizzoshould not be assumed to be a power of two.
34517885a7bSLuigi Rizzo.Pp
34617885a7bSLuigi Rizzo.Va head
34717885a7bSLuigi Rizzois the first slot available to userspace;
3483f879a47SEnji Cooper.Pp
34917885a7bSLuigi Rizzo.Va cur
35017885a7bSLuigi Rizzois the wakeup point:
35117885a7bSLuigi Rizzoselect/poll will unblock when
35217885a7bSLuigi Rizzo.Va tail
35317885a7bSLuigi Rizzopasses
35417885a7bSLuigi Rizzo.Va cur ;
3553f879a47SEnji Cooper.Pp
35617885a7bSLuigi Rizzo.Va tail
35717885a7bSLuigi Rizzois the first slot reserved to the kernel.
35817885a7bSLuigi Rizzo.Pp
359e91d04f7SChristian BruefferSlot indexes
360e91d04f7SChristian Brueffer.Em must
361e91d04f7SChristian Bruefferonly move forward;
36217885a7bSLuigi Rizzofor convenience, the function
36317885a7bSLuigi Rizzo.Dl nm_ring_next(ring, index)
36417885a7bSLuigi Rizzoreturns the next index modulo the ring size.
36517885a7bSLuigi Rizzo.Pp
36617885a7bSLuigi Rizzo.Va head
36717885a7bSLuigi Rizzoand
36817885a7bSLuigi Rizzo.Va cur
36917885a7bSLuigi Rizzoare only modified by the user program;
37017885a7bSLuigi Rizzo.Va tail
37117885a7bSLuigi Rizzois only modified by the kernel.
37217885a7bSLuigi RizzoThe kernel only reads/writes the
37317885a7bSLuigi Rizzo.Vt struct netmap_ring
37417885a7bSLuigi Rizzoslots and buffers
37517885a7bSLuigi Rizzoduring the execution of a netmap-related system call.
37617885a7bSLuigi RizzoThe only exception are slots (and buffers) in the range
37717885a7bSLuigi Rizzo.Va tail\  . . . head-1 ,
37817885a7bSLuigi Rizzothat are explicitly assigned to the kernel.
37917885a7bSLuigi Rizzo.Ss TRANSMIT RINGS
38017885a7bSLuigi RizzoOn transmit rings, after a
38117885a7bSLuigi Rizzo.Nm
38217885a7bSLuigi Rizzosystem call, slots in the range
38317885a7bSLuigi Rizzo.Va head\  . . . tail-1
38417885a7bSLuigi Rizzoare available for transmission.
38517885a7bSLuigi RizzoUser code should fill the slots sequentially
38617885a7bSLuigi Rizzoand advance
38717885a7bSLuigi Rizzo.Va head
38817885a7bSLuigi Rizzoand
38917885a7bSLuigi Rizzo.Va cur
39017885a7bSLuigi Rizzopast slots ready to transmit.
39117885a7bSLuigi Rizzo.Va cur
39217885a7bSLuigi Rizzomay be moved further ahead if the user code needs
39317885a7bSLuigi Rizzomore slots before further transmissions (see
39417885a7bSLuigi Rizzo.Sx SCATTER GATHER I/O ) .
39517885a7bSLuigi Rizzo.Pp
39617885a7bSLuigi RizzoAt the next NIOCTXSYNC/select()/poll(),
39717885a7bSLuigi Rizzoslots up to
39817885a7bSLuigi Rizzo.Va head-1
39917885a7bSLuigi Rizzoare pushed to the port, and
40017885a7bSLuigi Rizzo.Va tail
40117885a7bSLuigi Rizzomay advance if further slots have become available.
40217885a7bSLuigi RizzoBelow is an example of the evolution of a TX ring:
40317885a7bSLuigi Rizzo.Bd -literal
40417885a7bSLuigi Rizzo    after the syscall, slots between cur and tail are (a)vailable
40517885a7bSLuigi Rizzo              head=cur   tail
40617885a7bSLuigi Rizzo               |          |
40717885a7bSLuigi Rizzo               v          v
40817885a7bSLuigi Rizzo     TX  [.....aaaaaaaaaaa.............]
40917885a7bSLuigi Rizzo
41017885a7bSLuigi Rizzo    user creates new packets to (T)ransmit
41117885a7bSLuigi Rizzo                head=cur tail
41217885a7bSLuigi Rizzo                    |     |
41317885a7bSLuigi Rizzo                    v     v
41417885a7bSLuigi Rizzo     TX  [.....TTTTTaaaaaa.............]
41517885a7bSLuigi Rizzo
41617885a7bSLuigi Rizzo    NIOCTXSYNC/poll()/select() sends packets and reports new slots
41717885a7bSLuigi Rizzo                head=cur      tail
41817885a7bSLuigi Rizzo                    |          |
41917885a7bSLuigi Rizzo                    v          v
42017885a7bSLuigi Rizzo     TX  [..........aaaaaaaaaaa........]
42117885a7bSLuigi Rizzo.Ed
42217885a7bSLuigi Rizzo.Pp
423e91d04f7SChristian Brueffer.Fn select
424e91d04f7SChristian Bruefferand
425e91d04f7SChristian Brueffer.Fn poll
4263f879a47SEnji Cooperwill block if there is no space in the ring, i.e.,
42717885a7bSLuigi Rizzo.Dl ring->cur == ring->tail
42817885a7bSLuigi Rizzoand return when new slots have become available.
42917885a7bSLuigi Rizzo.Pp
43017885a7bSLuigi RizzoHigh speed applications may want to amortize the cost of system calls
43117885a7bSLuigi Rizzoby preparing as many packets as possible before issuing them.
43217885a7bSLuigi Rizzo.Pp
43317885a7bSLuigi RizzoA transmit ring with pending transmissions has
43417885a7bSLuigi Rizzo.Dl ring->head != ring->tail + 1 (modulo the ring size).
43517885a7bSLuigi RizzoThe function
43617885a7bSLuigi Rizzo.Va int nm_tx_pending(ring)
43717885a7bSLuigi Rizzoimplements this test.
43817885a7bSLuigi Rizzo.Ss RECEIVE RINGS
43917885a7bSLuigi RizzoOn receive rings, after a
44017885a7bSLuigi Rizzo.Nm
44117885a7bSLuigi Rizzosystem call, the slots in the range
44217885a7bSLuigi Rizzo.Va head\& . . . tail-1
44317885a7bSLuigi Rizzocontain received packets.
44417885a7bSLuigi RizzoUser code should process them and advance
44517885a7bSLuigi Rizzo.Va head
44617885a7bSLuigi Rizzoand
44717885a7bSLuigi Rizzo.Va cur
44817885a7bSLuigi Rizzopast slots it wants to return to the kernel.
44917885a7bSLuigi Rizzo.Va cur
45017885a7bSLuigi Rizzomay be moved further ahead if the user code wants to
45117885a7bSLuigi Rizzowait for more packets
45217885a7bSLuigi Rizzowithout returning all the previous slots to the kernel.
45317885a7bSLuigi Rizzo.Pp
45417885a7bSLuigi RizzoAt the next NIOCRXSYNC/select()/poll(),
45517885a7bSLuigi Rizzoslots up to
45617885a7bSLuigi Rizzo.Va head-1
45717885a7bSLuigi Rizzoare returned to the kernel for further receives, and
45817885a7bSLuigi Rizzo.Va tail
45917885a7bSLuigi Rizzomay advance to report new incoming packets.
4603f879a47SEnji Cooper.Pp
46117885a7bSLuigi RizzoBelow is an example of the evolution of an RX ring:
46217885a7bSLuigi Rizzo.Bd -literal
46317885a7bSLuigi Rizzo    after the syscall, there are some (h)eld and some (R)eceived slots
46417885a7bSLuigi Rizzo           head  cur     tail
46517885a7bSLuigi Rizzo            |     |       |
46617885a7bSLuigi Rizzo            v     v       v
46717885a7bSLuigi Rizzo     RX  [..hhhhhhRRRRRRRR..........]
46817885a7bSLuigi Rizzo
46917885a7bSLuigi Rizzo    user advances head and cur, releasing some slots and holding others
47017885a7bSLuigi Rizzo               head cur  tail
47117885a7bSLuigi Rizzo                 |  |     |
47217885a7bSLuigi Rizzo                 v  v     v
47317885a7bSLuigi Rizzo     RX  [..*****hhhRRRRRR...........]
47417885a7bSLuigi Rizzo
47517885a7bSLuigi Rizzo    NICRXSYNC/poll()/select() recovers slots and reports new packets
47617885a7bSLuigi Rizzo               head cur        tail
47717885a7bSLuigi Rizzo                 |  |           |
47817885a7bSLuigi Rizzo                 v  v           v
47917885a7bSLuigi Rizzo     RX  [.......hhhRRRRRRRRRRRR....]
48017885a7bSLuigi Rizzo.Ed
48117885a7bSLuigi Rizzo.Sh SLOTS AND PACKET BUFFERS
48217885a7bSLuigi RizzoNormally, packets should be stored in the netmap-allocated buffers
48317885a7bSLuigi Rizzoassigned to slots when ports are bound to a file descriptor.
48417885a7bSLuigi RizzoOne packet is fully contained in a single buffer.
48517885a7bSLuigi Rizzo.Pp
48617885a7bSLuigi RizzoThe following flags affect slot and buffer processing:
487ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX
488ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED
489e91d04f7SChristian Brueffer.Em must
490e91d04f7SChristian Bruefferbe used when the
491e91d04f7SChristian Brueffer.Va buf_idx
492e91d04f7SChristian Bruefferin the slot is changed.
49317885a7bSLuigi RizzoThis can be used to implement
49417885a7bSLuigi Rizzozero-copy forwarding, see
49517885a7bSLuigi Rizzo.Sx ZERO-COPY FORWARDING .
496ce3ee1e7SLuigi Rizzo.It NS_REPORT
49717885a7bSLuigi Rizzoreports when this buffer has been transmitted.
498ce3ee1e7SLuigi RizzoNormally,
499ce3ee1e7SLuigi Rizzo.Nm
500ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals
501e91d04f7SChristian Brueffercan be delayed indefinitely.
502e91d04f7SChristian BruefferThis flag helps detect
503e91d04f7SChristian Bruefferwhen packets have been sent and a file descriptor can be closed.
504ce3ee1e7SLuigi Rizzo.It NS_FORWARD
505668e070fSVincenzo MaffioneWhen a ring is in 'transparent' mode,
506668e070fSVincenzo Maffionepackets marked with this flag by the user application are forwarded to the
507668e070fSVincenzo Maffioneother endpoint at the next system call, thus restoring (in a selective way)
50817885a7bSLuigi Rizzothe connection between a NIC and the host stack.
509ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN
510e91d04f7SChristian Brueffertells the forwarding code that the source MAC address for this
51117885a7bSLuigi Rizzopacket must not be used in the learning bridge code.
512ce3ee1e7SLuigi Rizzo.It NS_INDIRECT
513e91d04f7SChristian Bruefferindicates that the packet's payload is in a user-supplied buffer
51417885a7bSLuigi Rizzowhose user virtual address is in the 'ptr' field of the slot.
515ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes.
5163f879a47SEnji Cooper.Pp
51717885a7bSLuigi RizzoThis is only supported on the transmit ring of
51817885a7bSLuigi Rizzo.Nm VALE
51917885a7bSLuigi Rizzoports, and it helps reducing data copies in the interconnection
52017885a7bSLuigi Rizzoof virtual machines.
521ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG
522ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers;
523ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear.
524ce3ee1e7SLuigi Rizzo.El
52517885a7bSLuigi Rizzo.Sh SCATTER GATHER I/O
52617885a7bSLuigi RizzoPackets can span multiple slots if the
52717885a7bSLuigi Rizzo.Va NS_MOREFRAG
52817885a7bSLuigi Rizzoflag is set in all but the last slot.
52917885a7bSLuigi RizzoThe maximum length of a chain is 64 buffers.
53017885a7bSLuigi RizzoThis is normally used with
53117885a7bSLuigi Rizzo.Nm VALE
53217885a7bSLuigi Rizzoports when connecting virtual machines, as they generate large
53317885a7bSLuigi RizzoTSO segments that are not split unless they reach a physical device.
53417885a7bSLuigi Rizzo.Pp
53517885a7bSLuigi RizzoNOTE: The length field always refers to the individual
53617885a7bSLuigi Rizzofragment; there is no place with the total length of a packet.
53717885a7bSLuigi Rizzo.Pp
53817885a7bSLuigi RizzoOn receive rings the macro
53917885a7bSLuigi Rizzo.Va NS_RFRAGS(slot)
54017885a7bSLuigi Rizzoindicates the remaining number of slots for this packet,
54117885a7bSLuigi Rizzoincluding the current one.
54217885a7bSLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set.
54313a5d88fSLuigi Rizzo.Sh IOCTLS
54468b8534bSLuigi Rizzo.Nm
54517885a7bSLuigi Rizzouses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
546e91d04f7SChristian Bruefferfor non-blocking I/O.
547e91d04f7SChristian BruefferThey take no argument.
54817885a7bSLuigi RizzoTwo more ioctls (NIOCGINFO, NIOCREGIF) are used
54917885a7bSLuigi Rizzoto query and configure ports, with the following argument:
55068b8534bSLuigi Rizzo.Bd -literal
55168b8534bSLuigi Rizzostruct nmreq {
55217885a7bSLuigi Rizzo    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
55317885a7bSLuigi Rizzo    uint32_t  nr_version;        /* (i) API version                */
55417885a7bSLuigi Rizzo    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
55517885a7bSLuigi Rizzo    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
556fa7db06bSLuigi Rizzo    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
557fa7db06bSLuigi Rizzo    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
558fa7db06bSLuigi Rizzo    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
559e91d04f7SChristian Brueffer    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
560fa7db06bSLuigi Rizzo    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
56117885a7bSLuigi Rizzo    uint16_t  nr_cmd;            /* (i) special command            */
562fa7db06bSLuigi Rizzo    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
563fa7db06bSLuigi Rizzo    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
564fa7db06bSLuigi Rizzo    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
565fa7db06bSLuigi Rizzo    uint32_t  nr_flags           /* (i/o) open mode                */
56617885a7bSLuigi Rizzo    ...
56768b8534bSLuigi Rizzo};
56868b8534bSLuigi Rizzo.Ed
56968b8534bSLuigi Rizzo.Pp
57017885a7bSLuigi RizzoA file descriptor obtained through
57117885a7bSLuigi Rizzo.Pa /dev/netmap
57217885a7bSLuigi Rizzoalso supports the ioctl supported by network devices, see
57317885a7bSLuigi Rizzo.Xr netintro 4 .
57468b8534bSLuigi Rizzo.Bl -tag -width XXXX
57568b8534bSLuigi Rizzo.It Dv NIOCGINFO
57617885a7bSLuigi Rizzoreturns EINVAL if the named port does not support netmap.
577ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information
57817885a7bSLuigi Rizzoabout the port.
579ce3ee1e7SLuigi RizzoNote that all the information below can change before the
580ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode.
58117885a7bSLuigi Rizzo.Bl -tag -width XX
58217885a7bSLuigi Rizzo.It Pa nr_memsize
58317885a7bSLuigi Rizzoindicates the size of the
58417885a7bSLuigi Rizzo.Nm
585e91d04f7SChristian Brueffermemory region.
586e91d04f7SChristian BruefferNICs in
58717885a7bSLuigi Rizzo.Nm
58817885a7bSLuigi Rizzomode all share the same memory region,
58917885a7bSLuigi Rizzowhereas
59017885a7bSLuigi Rizzo.Nm VALE
59117885a7bSLuigi Rizzoports have independent regions for each port.
59217885a7bSLuigi Rizzo.It Pa nr_tx_slots , nr_rx_slots
593ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings.
59417885a7bSLuigi Rizzo.It Pa nr_tx_rings , nr_rx_rings
595ce3ee1e7SLuigi Rizzoindicate the number of transmit
596ce3ee1e7SLuigi Rizzoand receive rings.
597ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime
5983f879a47SEnji Cooperusing interface-specific functions (e.g.,
5993f879a47SEnji Cooper.Xr ethtool 8
60017885a7bSLuigi Rizzo).
60117885a7bSLuigi Rizzo.El
60268b8534bSLuigi Rizzo.It Dv NIOCREGIF
60317885a7bSLuigi Rizzobinds the port named in
60417885a7bSLuigi Rizzo.Va nr_name
605e91d04f7SChristian Bruefferto the file descriptor.
606e91d04f7SChristian BruefferFor a physical device this also switches it into
60717885a7bSLuigi Rizzo.Nm
60817885a7bSLuigi Rizzomode, disconnecting
60917885a7bSLuigi Rizzoit from the host stack.
61017885a7bSLuigi RizzoMultiple file descriptors can be bound to the same port,
61117885a7bSLuigi Rizzowith proper synchronization left to the user.
61217885a7bSLuigi Rizzo.Pp
61337e3a6d3SLuigi RizzoThe recommended way to bind a file descriptor to a port is
61437e3a6d3SLuigi Rizzoto use function
61537e3a6d3SLuigi Rizzo.Va nm_open(..)
61637e3a6d3SLuigi Rizzo(see
6173f879a47SEnji Cooper.Sx LIBRARIES )
61837e3a6d3SLuigi Rizzowhich parses names to access specific port types and
61937e3a6d3SLuigi Rizzoenable features.
62037e3a6d3SLuigi RizzoIn the following we document the main features.
62137e3a6d3SLuigi Rizzo.Pp
622fa7db06bSLuigi Rizzo.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
623fa7db06bSLuigi Rizzo.Em netmap pipe ,
624fa7db06bSLuigi Rizzoconsisting of two netmap ports with a crossover connection.
625fa7db06bSLuigi RizzoA netmap pipe share the same memory space of the parent port,
626fa7db06bSLuigi Rizzoand is meant to enable configuration where a master process acts
627fa7db06bSLuigi Rizzoas a dispatcher towards slave processes.
628fa7db06bSLuigi Rizzo.Pp
629fa7db06bSLuigi RizzoTo enable this function, the
630fa7db06bSLuigi Rizzo.Pa nr_arg1
631fa7db06bSLuigi Rizzofield of the structure can be used as a hint to the kernel to
632fa7db06bSLuigi Rizzoindicate how many pipes we expect to use, and reserve extra space
633fa7db06bSLuigi Rizzoin the memory region.
634fa7db06bSLuigi Rizzo.Pp
635fa7db06bSLuigi RizzoOn return, it gives the same info as NIOCGINFO,
636fa7db06bSLuigi Rizzowith
637fa7db06bSLuigi Rizzo.Pa nr_ringid
638fa7db06bSLuigi Rizzoand
639fa7db06bSLuigi Rizzo.Pa nr_flags
640fa7db06bSLuigi Rizzoindicating the identity of the rings controlled through the file
64168b8534bSLuigi Rizzodescriptor.
64268b8534bSLuigi Rizzo.Pp
643fa7db06bSLuigi Rizzo.Va nr_flags
64417885a7bSLuigi Rizzo.Va nr_ringid
64517885a7bSLuigi Rizzoselects which rings are controlled through this file descriptor.
646fa7db06bSLuigi RizzoPossible values of
647fa7db06bSLuigi Rizzo.Pa nr_flags
648fa7db06bSLuigi Rizzoare indicated below, together with the naming schemes
649fa7db06bSLuigi Rizzothat application libraries (such as the
650fa7db06bSLuigi Rizzo.Nm nm_open
651fa7db06bSLuigi Rizzoindicated below) can use to indicate the specific set of rings.
652fa7db06bSLuigi RizzoIn the example below, "netmap:foo" is any valid netmap port name.
65368b8534bSLuigi Rizzo.Bl -tag -width XXXXX
654fa7db06bSLuigi Rizzo.It NR_REG_ALL_NIC                         "netmap:foo"
655fa7db06bSLuigi Rizzo(default) all hardware ring pairs
656415dfa83SMaxim Sobolev.It NR_REG_SW            "netmap:foo^"
65717885a7bSLuigi Rizzothe ``host rings'', connecting to the host stack.
658d4d112e3SJoel Dahl.It NR_REG_NIC_SW        "netmap:foo+"
659fa7db06bSLuigi Rizzoall hardware rings and the host rings
660fa7db06bSLuigi Rizzo.It NR_REG_ONE_NIC       "netmap:foo-i"
661fa7db06bSLuigi Rizzoonly the i-th hardware ring pair, where the number is in
662fa7db06bSLuigi Rizzo.Pa nr_ringid ;
663fa7db06bSLuigi Rizzo.It NR_REG_PIPE_MASTER  "netmap:foo{i"
664fa7db06bSLuigi Rizzothe master side of the netmap pipe whose identifier (i) is in
665fa7db06bSLuigi Rizzo.Pa nr_ringid ;
666fa7db06bSLuigi Rizzo.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
667fa7db06bSLuigi Rizzothe slave side of the netmap pipe whose identifier (i) is in
668fa7db06bSLuigi Rizzo.Pa nr_ringid .
669fa7db06bSLuigi Rizzo.Pp
670fa7db06bSLuigi RizzoThe identifier of a pipe must be thought as part of the pipe name,
671e91d04f7SChristian Bruefferand does not need to be sequential.
672e91d04f7SChristian BruefferOn return the pipe
673fa7db06bSLuigi Rizzowill only have a single ring pair with index 0,
674e91d04f7SChristian Bruefferirrespective of the value of
675e91d04f7SChristian Brueffer.Va i .
67668b8534bSLuigi Rizzo.El
67717885a7bSLuigi Rizzo.Pp
67868b8534bSLuigi RizzoBy default, a
67917885a7bSLuigi Rizzo.Xr poll 2
68068b8534bSLuigi Rizzoor
68117885a7bSLuigi Rizzo.Xr select 2
68268b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if
68368b8534bSLuigi Rizzono write events are specified.
68468b8534bSLuigi RizzoThe feature can be disabled by or-ing
685415dfa83SMaxim Sobolev.Va NETMAP_NO_TX_POLL
68617885a7bSLuigi Rizzoto the value written to
68717885a7bSLuigi Rizzo.Va nr_ringid .
68817885a7bSLuigi RizzoWhen this feature is used,
68917885a7bSLuigi Rizzopackets are transmitted only on
69017885a7bSLuigi Rizzo.Va ioctl(NIOCTXSYNC)
691668e070fSVincenzo Maffioneor
692668e070fSVincenzo Maffione.Va select() /
693668e070fSVincenzo Maffione.Va poll()
694668e070fSVincenzo Maffioneare called with a write event (POLLOUT/wfdset) or a full ring.
695ce3ee1e7SLuigi Rizzo.Pp
696ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a
697ce3ee1e7SLuigi Rizzo.Xr vale 4
698ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default,
699ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
70068b8534bSLuigi Rizzo.It Dv NIOCTXSYNC
70168b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the
70268b8534bSLuigi Rizzonumber of slots available for transmission.
70368b8534bSLuigi Rizzo.It Dv NIOCRXSYNC
70468b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available
70568b8534bSLuigi Rizzopackets.
70668b8534bSLuigi Rizzo.El
707668e070fSVincenzo Maffione.Sh SELECT, POLL, EPOLL, KQUEUE
70817885a7bSLuigi Rizzo.Xr select 2
70917885a7bSLuigi Rizzoand
71017885a7bSLuigi Rizzo.Xr poll 2
71117885a7bSLuigi Rizzoon a
71217885a7bSLuigi Rizzo.Nm
71317885a7bSLuigi Rizzofile descriptor process rings as indicated in
71417885a7bSLuigi Rizzo.Sx TRANSMIT RINGS
71517885a7bSLuigi Rizzoand
716fa7db06bSLuigi Rizzo.Sx RECEIVE RINGS ,
717fa7db06bSLuigi Rizzorespectively when write (POLLOUT) and read (POLLIN) events are requested.
718fa7db06bSLuigi RizzoBoth block if no slots are available in the ring
719fa7db06bSLuigi Rizzo.Va ( ring->cur == ring->tail ) .
720fa7db06bSLuigi RizzoDepending on the platform,
721668e070fSVincenzo Maffione.Xr epoll 7
722fa7db06bSLuigi Rizzoand
723fa7db06bSLuigi Rizzo.Xr kqueue 2
724fa7db06bSLuigi Rizzoare supported too.
72517885a7bSLuigi Rizzo.Pp
726fa7db06bSLuigi RizzoPackets in transmit rings are normally pushed out
727fa7db06bSLuigi Rizzo(and buffers reclaimed) even without
728e91d04f7SChristian Bruefferrequesting write events.
729e91d04f7SChristian BruefferPassing the
730e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL
731e91d04f7SChristian Bruefferflag to
73217885a7bSLuigi Rizzo.Em NIOCREGIF
73317885a7bSLuigi Rizzodisables this feature.
734fa7db06bSLuigi RizzoBy default, receive rings are processed only if read
735e91d04f7SChristian Bruefferevents are requested.
736e91d04f7SChristian BruefferPassing the
737e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL
738e91d04f7SChristian Bruefferflag to
739fa7db06bSLuigi Rizzo.Em NIOCREGIF updates receive rings even without read events.
740668e070fSVincenzo MaffioneNote that on
741668e070fSVincenzo Maffione.Xr epoll 7
742668e070fSVincenzo Maffioneand
743668e070fSVincenzo Maffione.Xr kqueue 2 ,
744e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL
745e91d04f7SChristian Bruefferand
746e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL
747fa7db06bSLuigi Rizzoonly have an effect when some event is posted for the file descriptor.
74817885a7bSLuigi Rizzo.Sh LIBRARIES
74917885a7bSLuigi RizzoThe
75017885a7bSLuigi Rizzo.Nm
75117885a7bSLuigi RizzoAPI is supposed to be used directly, both because of its simplicity and
75217885a7bSLuigi Rizzofor efficient integration with applications.
75317885a7bSLuigi Rizzo.Pp
754e91d04f7SChristian BruefferFor convenience, the
755e91d04f7SChristian Brueffer.In net/netmap_user.h
75617885a7bSLuigi Rizzoheader provides a few macros and functions to ease creating
75717885a7bSLuigi Rizzoa file descriptor and doing I/O with a
75817885a7bSLuigi Rizzo.Nm
759e91d04f7SChristian Bruefferport.
760e91d04f7SChristian BruefferThese are loosely modeled after the
76117885a7bSLuigi Rizzo.Xr pcap 3
76217885a7bSLuigi RizzoAPI, to ease porting of libpcap-based applications to
76317885a7bSLuigi Rizzo.Nm .
76417885a7bSLuigi RizzoTo use these extra functions, programs should
76517885a7bSLuigi Rizzo.Dl #define NETMAP_WITH_LIBS
76617885a7bSLuigi Rizzobefore
76717885a7bSLuigi Rizzo.Dl #include <net/netmap_user.h>
76817885a7bSLuigi Rizzo.Pp
76917885a7bSLuigi RizzoThe following functions are available:
77017885a7bSLuigi Rizzo.Bl -tag -width XXXXX
771fa7db06bSLuigi Rizzo.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg )
77217885a7bSLuigi Rizzosimilar to
773668e070fSVincenzo Maffione.Xr pcap_open_live 3 ,
77417885a7bSLuigi Rizzobinds a file descriptor to a port.
77517885a7bSLuigi Rizzo.Bl -tag -width XX
77617885a7bSLuigi Rizzo.It Va ifname
77737e3a6d3SLuigi Rizzois a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
77817885a7bSLuigi Rizzo.Nm VALE
77917885a7bSLuigi Rizzoport.
780fa7db06bSLuigi Rizzo.It Va req
781fa7db06bSLuigi Rizzoprovides the initial values for the argument to the NIOCREGIF ioctl.
782fa7db06bSLuigi RizzoThe nm_flags and nm_ringid values are overwritten by parsing
783fa7db06bSLuigi Rizzoifname and flags, and other fields can be overridden through
784fa7db06bSLuigi Rizzothe other two arguments.
785fa7db06bSLuigi Rizzo.It Va arg
7863f879a47SEnji Cooperpoints to a struct nm_desc containing arguments (e.g., from a previously
787fa7db06bSLuigi Rizzoopen file descriptor) that should override the defaults.
788fa7db06bSLuigi RizzoThe fields are used as described below
78917885a7bSLuigi Rizzo.It Va flags
790fa7db06bSLuigi Rizzocan be set to a combination of the following flags:
791fa7db06bSLuigi Rizzo.Va NETMAP_NO_TX_POLL ,
792fa7db06bSLuigi Rizzo.Va NETMAP_DO_RX_POLL
793fa7db06bSLuigi Rizzo(copied into nr_ringid);
794668e070fSVincenzo Maffione.Va NM_OPEN_NO_MMAP
795668e070fSVincenzo Maffione(if arg points to the same memory region,
796fa7db06bSLuigi Rizzoavoids the mmap and uses the values from it);
797668e070fSVincenzo Maffione.Va NM_OPEN_IFNAME
798668e070fSVincenzo Maffione(ignores ifname and uses the values in arg);
799fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG1 ,
800fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG2 ,
801668e070fSVincenzo Maffione.Va NM_OPEN_ARG3
802668e070fSVincenzo Maffione(uses the fields from arg);
803668e070fSVincenzo Maffione.Va NM_OPEN_RING_CFG
804668e070fSVincenzo Maffione(uses the ring number and sizes from arg).
80517885a7bSLuigi Rizzo.El
806fa7db06bSLuigi Rizzo.It Va int nm_close(struct nm_desc *d )
80717885a7bSLuigi Rizzocloses the file descriptor, unmaps memory, frees resources.
808fa7db06bSLuigi Rizzo.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size )
809668e070fSVincenzo Maffionesimilar to
810668e070fSVincenzo Maffione.Va pcap_inject() ,
811668e070fSVincenzo Maffionepushes a packet to a ring, returns the size
81217885a7bSLuigi Rizzoof the packet is successful, or 0 on error;
813fa7db06bSLuigi Rizzo.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg )
814668e070fSVincenzo Maffionesimilar to
815668e070fSVincenzo Maffione.Va pcap_dispatch() ,
816668e070fSVincenzo Maffioneapplies a callback to incoming packets
817fa7db06bSLuigi Rizzo.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr )
818668e070fSVincenzo Maffionesimilar to
819668e070fSVincenzo Maffione.Va pcap_next() ,
820668e070fSVincenzo Maffionefetches the next packet
82117885a7bSLuigi Rizzo.El
82217885a7bSLuigi Rizzo.Sh SUPPORTED DEVICES
82317885a7bSLuigi Rizzo.Nm
82417885a7bSLuigi Rizzonatively supports the following devices:
82517885a7bSLuigi Rizzo.Pp
826668e070fSVincenzo MaffioneOn
827668e070fSVincenzo Maffione.Fx :
82837e3a6d3SLuigi Rizzo.Xr cxgbe 4 ,
82917885a7bSLuigi Rizzo.Xr em 4 ,
830668e070fSVincenzo Maffione.Xr iflib 4
831*1c37b63fSGlen Barber.Pq providing Xr igb 4 and Xr em 4 ,
83217885a7bSLuigi Rizzo.Xr ixgbe 4 ,
83337e3a6d3SLuigi Rizzo.Xr ixl 4 ,
834668e070fSVincenzo Maffione.Xr re 4 ,
835668e070fSVincenzo Maffione.Xr vtnet 4 .
83617885a7bSLuigi Rizzo.Pp
837668e070fSVincenzo MaffioneOn Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
83817885a7bSLuigi Rizzo.Pp
83917885a7bSLuigi RizzoNICs without native support can still be used in
84017885a7bSLuigi Rizzo.Nm
841e91d04f7SChristian Brueffermode through emulation.
842e91d04f7SChristian BruefferPerformance is inferior to native netmap
84337e3a6d3SLuigi Rizzomode but still significantly higher than various raw socket types
84437e3a6d3SLuigi Rizzo(bpf, PF_PACKET, etc.).
84537e3a6d3SLuigi RizzoNote that for slow devices (such as 1 Gbit/s and slower NICs,
846fbb9e715SLuigi Rizzoor several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
847fbb9e715SLuigi Rizzoemulated and native mode will likely have similar or same throughput.
8483f879a47SEnji Cooper.Pp
84937e3a6d3SLuigi RizzoWhen emulation is in use, packet sniffer programs such as tcpdump
8503f879a47SEnji Coopercould see received packets before they are diverted by netmap.
8513f879a47SEnji CooperThis behaviour is not intentional, being just an artifact of the implementation
8523f879a47SEnji Cooperof emulation.
85337e3a6d3SLuigi RizzoNote that in case the netmap application subsequently moves packets received
85437e3a6d3SLuigi Rizzofrom the emulated adapter onto the host RX ring, the sniffer will intercept
85537e3a6d3SLuigi Rizzothose packets again, since the packets are injected to the host stack as they
85637e3a6d3SLuigi Rizzowere received by the network interface.
85717885a7bSLuigi Rizzo.Pp
85817885a7bSLuigi RizzoEmulation is also available for devices with native netmap support,
85917885a7bSLuigi Rizzowhich can be used for testing or performance comparison.
86017885a7bSLuigi RizzoThe sysctl variable
86117885a7bSLuigi Rizzo.Va dev.netmap.admode
86217885a7bSLuigi Rizzoglobally controls how netmap mode is implemented.
86317885a7bSLuigi Rizzo.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
86417885a7bSLuigi RizzoSome aspect of the operation of
86517885a7bSLuigi Rizzo.Nm
866668e070fSVincenzo Maffioneare controlled through sysctl variables on
867668e070fSVincenzo Maffione.Fx
86817885a7bSLuigi Rizzo.Em ( dev.netmap.* )
86917885a7bSLuigi Rizzoand module parameters on Linux
870668e070fSVincenzo Maffione.Em ( /sys/module/netmap/parameters/* ) :
87117885a7bSLuigi Rizzo.Bl -tag -width indent
87217885a7bSLuigi Rizzo.It Va dev.netmap.admode: 0
87317885a7bSLuigi RizzoControls the use of native or emulated adapter mode.
8743f879a47SEnji Cooper.Pp
87537e3a6d3SLuigi Rizzo0 uses the best available option;
8763f879a47SEnji Cooper.Pp
87737e3a6d3SLuigi Rizzo1 forces native mode and fails if not available;
8783f879a47SEnji Cooper.Pp
87937e3a6d3SLuigi Rizzo2 forces emulated hence never fails.
880668e070fSVincenzo Maffione.It Va dev.netmap.generic_rings: 1
881668e070fSVincenzo MaffioneNumber of rings used for emulated netmap mode
88217885a7bSLuigi Rizzo.It Va dev.netmap.generic_ringsize: 1024
88317885a7bSLuigi RizzoRing size used for emulated netmap mode
88417885a7bSLuigi Rizzo.It Va dev.netmap.generic_mit: 100000
88517885a7bSLuigi RizzoControls interrupt moderation for emulated mode
88617885a7bSLuigi Rizzo.It Va dev.netmap.mmap_unreg: 0
88717885a7bSLuigi Rizzo.It Va dev.netmap.fwd: 0
88817885a7bSLuigi RizzoForces NS_FORWARD mode
88917885a7bSLuigi Rizzo.It Va dev.netmap.flags: 0
89017885a7bSLuigi Rizzo.It Va dev.netmap.txsync_retry: 2
89117885a7bSLuigi Rizzo.It Va dev.netmap.no_pendintr: 1
89217885a7bSLuigi RizzoForces recovery of transmit buffers on system calls
89317885a7bSLuigi Rizzo.It Va dev.netmap.mitigate: 1
89417885a7bSLuigi RizzoPropagates interrupt mitigation to user processes
89517885a7bSLuigi Rizzo.It Va dev.netmap.no_timestamp: 0
89617885a7bSLuigi RizzoDisables the update of the timestamp in the netmap ring
89717885a7bSLuigi Rizzo.It Va dev.netmap.verbose: 0
89817885a7bSLuigi RizzoVerbose kernel messages
89917885a7bSLuigi Rizzo.It Va dev.netmap.buf_num: 163840
90017885a7bSLuigi Rizzo.It Va dev.netmap.buf_size: 2048
90117885a7bSLuigi Rizzo.It Va dev.netmap.ring_num: 200
90217885a7bSLuigi Rizzo.It Va dev.netmap.ring_size: 36864
90317885a7bSLuigi Rizzo.It Va dev.netmap.if_num: 100
90417885a7bSLuigi Rizzo.It Va dev.netmap.if_size: 1024
90517885a7bSLuigi RizzoSizes and number of objects (netmap_if, netmap_ring, buffers)
906e91d04f7SChristian Bruefferfor the global memory region.
907e91d04f7SChristian BruefferThe only parameter worth modifying is
90817885a7bSLuigi Rizzo.Va dev.netmap.buf_num
90917885a7bSLuigi Rizzoas it impacts the total amount of memory used by netmap.
91017885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_num: 0
91117885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_size: 0
91217885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_num: 0
91317885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_size: 0
91417885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_num: 0
91517885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_size: 0
91617885a7bSLuigi RizzoActual values in use.
91717885a7bSLuigi Rizzo.It Va dev.netmap.bridge_batch: 1024
91817885a7bSLuigi RizzoBatch size used when moving packets across a
91917885a7bSLuigi Rizzo.Nm VALE
920e91d04f7SChristian Bruefferswitch.
921e91d04f7SChristian BruefferValues above 64 generally guarantee good
92217885a7bSLuigi Rizzoperformance.
923668e070fSVincenzo Maffione.It Va dev.netmap.ptnet_vnet_hdr: 1
924668e070fSVincenzo MaffioneAllow ptnet devices to use virtio-net headers
92517885a7bSLuigi Rizzo.El
92613a5d88fSLuigi Rizzo.Sh SYSTEM CALLS
92768b8534bSLuigi Rizzo.Nm
92868b8534bSLuigi Rizzouses
929fa7db06bSLuigi Rizzo.Xr select 2 ,
930fa7db06bSLuigi Rizzo.Xr poll 2 ,
931668e070fSVincenzo Maffione.Xr epoll 7
93268b8534bSLuigi Rizzoand
93337e3a6d3SLuigi Rizzo.Xr kqueue 2
934ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and
935ce3ee1e7SLuigi Rizzo.Xr mmap 2
936ce3ee1e7SLuigi Rizzoto map memory.
93717885a7bSLuigi Rizzo.Xr ioctl 2
93817885a7bSLuigi Rizzois used to configure ports and
93917885a7bSLuigi Rizzo.Nm VALE switches .
940ce3ee1e7SLuigi Rizzo.Pp
941ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to
942ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard
943ce3ee1e7SLuigi RizzoOS primitives, see
944ce3ee1e7SLuigi Rizzo.Xr pthread 3 .
945ce3ee1e7SLuigi RizzoIn particular,
946ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3
947ce3ee1e7SLuigi Rizzomay be of use.
94868b8534bSLuigi Rizzo.Sh EXAMPLES
94917885a7bSLuigi Rizzo.Ss TEST PROGRAMS
95017885a7bSLuigi Rizzo.Nm
95117885a7bSLuigi Rizzocomes with a few programs that can be used for testing or
95217885a7bSLuigi Rizzosimple applications.
95317885a7bSLuigi RizzoSee the
954e91d04f7SChristian Brueffer.Pa examples/
95517885a7bSLuigi Rizzodirectory in
95617885a7bSLuigi Rizzo.Nm
95717885a7bSLuigi Rizzodistributions, or
958e91d04f7SChristian Brueffer.Pa tools/tools/netmap/
959e91d04f7SChristian Bruefferdirectory in
960e91d04f7SChristian Brueffer.Fx
961e91d04f7SChristian Bruefferdistributions.
96217885a7bSLuigi Rizzo.Pp
9633f879a47SEnji Cooper.Xr pkt-gen 8
96417885a7bSLuigi Rizzois a general purpose traffic source/sink.
96517885a7bSLuigi Rizzo.Pp
96617885a7bSLuigi RizzoAs an example
96717885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f tx -l 60
96817885a7bSLuigi Rizzocan generate an infinite stream of minimum size packets, and
96917885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f rx
97017885a7bSLuigi Rizzois a traffic sink.
97117885a7bSLuigi RizzoBoth print traffic statistics, to help monitor
97217885a7bSLuigi Rizzohow the system performs.
97317885a7bSLuigi Rizzo.Pp
9743f879a47SEnji Cooper.Xr pkt-gen 8
97517885a7bSLuigi Rizzohas many options can be uses to set packet sizes, addresses,
97617885a7bSLuigi Rizzorates, and use multiple send/receive threads and cores.
97717885a7bSLuigi Rizzo.Pp
9783f879a47SEnji Cooper.Xr bridge 4
97917885a7bSLuigi Rizzois another test program which interconnects two
98017885a7bSLuigi Rizzo.Nm
981e91d04f7SChristian Bruefferports.
982e91d04f7SChristian BruefferIt can be used for transparent forwarding between
98317885a7bSLuigi Rizzointerfaces, as in
984e1344f3cSVincenzo Maffione.Dl bridge -i netmap:ix0 -i netmap:ix1
98517885a7bSLuigi Rizzoor even connect the NIC to the host stack using netmap
986e1344f3cSVincenzo Maffione.Dl bridge -i netmap:ix0
98717885a7bSLuigi Rizzo.Ss USING THE NATIVE API
98868b8534bSLuigi RizzoThe following code implements a traffic generator
98968b8534bSLuigi Rizzo.Pp
99068b8534bSLuigi Rizzo.Bd -literal -compact
99168b8534bSLuigi Rizzo#include <net/netmap_user.h>
992fe1e4a6cSBaptiste Daroussin\&...
99317885a7bSLuigi Rizzovoid sender(void)
99417885a7bSLuigi Rizzo{
99568b8534bSLuigi Rizzo    struct netmap_if *nifp;
99668b8534bSLuigi Rizzo    struct netmap_ring *ring;
997d83a410eSHiren Panchasara    struct nmreq nmr;
99817885a7bSLuigi Rizzo    struct pollfd fds;
99968b8534bSLuigi Rizzo
100068b8534bSLuigi Rizzo    fd = open("/dev/netmap", O_RDWR);
100168b8534bSLuigi Rizzo    bzero(&nmr, sizeof(nmr));
1002d83a410eSHiren Panchasara    strcpy(nmr.nr_name, "ix0");
1003ce3ee1e7SLuigi Rizzo    nmr.nm_version = NETMAP_API;
1004ce3ee1e7SLuigi Rizzo    ioctl(fd, NIOCREGIF, &nmr);
1005d83a410eSHiren Panchasara    p = mmap(0, nmr.nr_memsize, fd);
1006ce3ee1e7SLuigi Rizzo    nifp = NETMAP_IF(p, nmr.nr_offset);
100768b8534bSLuigi Rizzo    ring = NETMAP_TXRING(nifp, 0);
100868b8534bSLuigi Rizzo    fds.fd = fd;
100968b8534bSLuigi Rizzo    fds.events = POLLOUT;
101068b8534bSLuigi Rizzo    for (;;) {
101117885a7bSLuigi Rizzo	poll(&fds, 1, -1);
101217885a7bSLuigi Rizzo	while (!nm_ring_empty(ring)) {
101368b8534bSLuigi Rizzo	    i = ring->cur;
101468b8534bSLuigi Rizzo	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
101568b8534bSLuigi Rizzo	    ... prepare packet in buf ...
101668b8534bSLuigi Rizzo	    ring->slot[i].len = ... packet length ...
101717885a7bSLuigi Rizzo	    ring->head = ring->cur = nm_ring_next(ring, i);
101817885a7bSLuigi Rizzo	}
101968b8534bSLuigi Rizzo    }
102068b8534bSLuigi Rizzo}
102168b8534bSLuigi Rizzo.Ed
102217885a7bSLuigi Rizzo.Ss HELPER FUNCTIONS
102317885a7bSLuigi RizzoA simple receiver can be implemented using the helper functions
102417885a7bSLuigi Rizzo.Bd -literal -compact
102517885a7bSLuigi Rizzo#define NETMAP_WITH_LIBS
102617885a7bSLuigi Rizzo#include <net/netmap_user.h>
1027fe1e4a6cSBaptiste Daroussin\&...
102817885a7bSLuigi Rizzovoid receiver(void)
102917885a7bSLuigi Rizzo{
1030fa7db06bSLuigi Rizzo    struct nm_desc *d;
103117885a7bSLuigi Rizzo    struct pollfd fds;
103217885a7bSLuigi Rizzo    u_char *buf;
1033fa7db06bSLuigi Rizzo    struct nm_pkthdr h;
103417885a7bSLuigi Rizzo    ...
103517885a7bSLuigi Rizzo    d = nm_open("netmap:ix0", NULL, 0, 0);
103617885a7bSLuigi Rizzo    fds.fd = NETMAP_FD(d);
103717885a7bSLuigi Rizzo    fds.events = POLLIN;
103817885a7bSLuigi Rizzo    for (;;) {
103917885a7bSLuigi Rizzo	poll(&fds, 1, -1);
104017885a7bSLuigi Rizzo        while ( (buf = nm_nextpkt(d, &h)) )
104117885a7bSLuigi Rizzo	    consume_pkt(buf, h->len);
104217885a7bSLuigi Rizzo    }
104317885a7bSLuigi Rizzo    nm_close(d);
104417885a7bSLuigi Rizzo}
104517885a7bSLuigi Rizzo.Ed
104617885a7bSLuigi Rizzo.Ss ZERO-COPY FORWARDING
104717885a7bSLuigi RizzoSince physical interfaces share the same memory region,
104817885a7bSLuigi Rizzoit is possible to do packet forwarding between ports
1049e91d04f7SChristian Bruefferswapping buffers.
1050e91d04f7SChristian BruefferThe buffer from the transmit ring is used
105117885a7bSLuigi Rizzoto replenish the receive ring:
105217885a7bSLuigi Rizzo.Bd -literal -compact
105317885a7bSLuigi Rizzo    uint32_t tmp;
105417885a7bSLuigi Rizzo    struct netmap_slot *src, *dst;
105517885a7bSLuigi Rizzo    ...
105617885a7bSLuigi Rizzo    src = &src_ring->slot[rxr->cur];
105717885a7bSLuigi Rizzo    dst = &dst_ring->slot[txr->cur];
105817885a7bSLuigi Rizzo    tmp = dst->buf_idx;
105917885a7bSLuigi Rizzo    dst->buf_idx = src->buf_idx;
106017885a7bSLuigi Rizzo    dst->len = src->len;
106117885a7bSLuigi Rizzo    dst->flags = NS_BUF_CHANGED;
106217885a7bSLuigi Rizzo    src->buf_idx = tmp;
106317885a7bSLuigi Rizzo    src->flags = NS_BUF_CHANGED;
106417885a7bSLuigi Rizzo    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
106517885a7bSLuigi Rizzo    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
106617885a7bSLuigi Rizzo    ...
106717885a7bSLuigi Rizzo.Ed
106817885a7bSLuigi Rizzo.Ss ACCESSING THE HOST STACK
1069fa7db06bSLuigi RizzoThe host stack is for all practical purposes just a regular ring pair,
10703f879a47SEnji Cooperwhich you can access with the netmap API (e.g., with
1071fa7db06bSLuigi Rizzo.Dl nm_open("netmap:eth0^", ... ) ;
1072fa7db06bSLuigi RizzoAll packets that the host would send to an interface in
1073fa7db06bSLuigi Rizzo.Nm
1074fa7db06bSLuigi Rizzomode end up into the RX ring, whereas all packets queued to the
1075fa7db06bSLuigi RizzoTX ring are send up to the host stack.
107617885a7bSLuigi Rizzo.Ss VALE SWITCH
107717885a7bSLuigi RizzoA simple way to test the performance of a
107817885a7bSLuigi Rizzo.Nm VALE
107917885a7bSLuigi Rizzoswitch is to attach a sender and a receiver to it,
10803f879a47SEnji Coopere.g., running the following in two different terminals:
108117885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:a -f rx # receiver
108217885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:b -f tx # sender
1083fa7db06bSLuigi RizzoThe same example can be used to test netmap pipes, by simply
10843f879a47SEnji Cooperchanging port names, e.g.,
108537e3a6d3SLuigi Rizzo.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
108637e3a6d3SLuigi Rizzo.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
108717885a7bSLuigi Rizzo.Pp
108817885a7bSLuigi RizzoThe following command attaches an interface and the host stack
108917885a7bSLuigi Rizzoto a switch:
1090c7c78055SVincenzo Maffione.Dl valectl -h vale2:em0
109117885a7bSLuigi RizzoOther
109268b8534bSLuigi Rizzo.Nm
109317885a7bSLuigi Rizzoclients attached to the same switch can now communicate
109417885a7bSLuigi Rizzowith the network card or the host.
109513a5d88fSLuigi Rizzo.Sh SEE ALSO
1096689f146bSVincenzo Maffione.Xr vale 4 ,
1097c7c78055SVincenzo Maffione.Xr valectl 8 ,
1098689f146bSVincenzo Maffione.Xr bridge 8 ,
1099689f146bSVincenzo Maffione.Xr lb 8 ,
1100668e070fSVincenzo Maffione.Xr nmreplay 8 ,
1101689f146bSVincenzo Maffione.Xr pkt-gen 8
11021a7d3c05SVincenzo Maffione.Pp
11030b3504fdSChristian Brueffer.Pa http://info.iet.unipi.it/~luigi/netmap/
110413a5d88fSLuigi Rizzo.Pp
110513a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework,
110613a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012
110713a5d88fSLuigi Rizzo.Pp
110813a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O,
110913a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston
1110fa7db06bSLuigi Rizzo.Pp
1111fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri,
1112fa7db06bSLuigi RizzoVALE, a switched ethernet for virtual machines,
1113fa7db06bSLuigi RizzoACM CoNEXT'12, December 2012, Nice
1114fa7db06bSLuigi Rizzo.Pp
1115fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1116fa7db06bSLuigi RizzoSpeeding up packet I/O in virtual machines,
1117fa7db06bSLuigi RizzoACM/IEEE ANCS'13, October 2013, San Jose
111868b8534bSLuigi Rizzo.Sh AUTHORS
111913a5d88fSLuigi Rizzo.An -nosplit
112068b8534bSLuigi RizzoThe
112168b8534bSLuigi Rizzo.Nm
1122ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the
112313a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by
112413a5d88fSLuigi Rizzo.An Luigi Rizzo ,
1125ce3ee1e7SLuigi Rizzoand further extended with help from
112613a5d88fSLuigi Rizzo.An Matteo Landi ,
112713a5d88fSLuigi Rizzo.An Gaetano Catalli ,
1128ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri ,
1129e91d04f7SChristian Bruefferand
1130ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione .
113113a5d88fSLuigi Rizzo.Pp
113213a5d88fSLuigi Rizzo.Nm
1133ce3ee1e7SLuigi Rizzoand
1134ce3ee1e7SLuigi Rizzo.Nm VALE
1135ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects
1136ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581).
1137bf15fc88SJoel Dahl.Sh CAVEATS
1138bf15fc88SJoel DahlNo matter how fast the CPU and OS are,
1139bf15fc88SJoel Dahlachieving line rate on 10G and faster interfaces
1140bf15fc88SJoel Dahlrequires hardware with sufficient performance.
1141bf15fc88SJoel DahlSeveral NICs are unable to sustain line rate with
1142e91d04f7SChristian Brueffersmall packet sizes.
1143e91d04f7SChristian BruefferInsufficient PCIe or memory bandwidth
1144bf15fc88SJoel Dahlcan also cause reduced performance.
1145bf15fc88SJoel Dahl.Pp
1146bf15fc88SJoel DahlAnother frequent reason for low performance is the use
1147bf15fc88SJoel Dahlof flow control on the link: a slow receiver can limit
1148bf15fc88SJoel Dahlthe transmit speed.
1149bf15fc88SJoel DahlBe sure to disable flow control when running high
1150bf15fc88SJoel Dahlspeed experiments.
1151bf15fc88SJoel Dahl.Ss SPECIAL NIC FEATURES
1152bf15fc88SJoel Dahl.Nm
1153bf15fc88SJoel Dahlis orthogonal to some NIC features such as
1154bf15fc88SJoel Dahlmultiqueue, schedulers, packet filters.
1155bf15fc88SJoel Dahl.Pp
1156bf15fc88SJoel DahlMultiple transmit and receive rings are supported natively
1157bf15fc88SJoel Dahland can be configured with ordinary OS tools,
1158bf15fc88SJoel Dahlsuch as
11593f879a47SEnji Cooper.Xr ethtool 8
1160bf15fc88SJoel Dahlor
1161bf15fc88SJoel Dahldevice-specific sysctl variables.
1162bf15fc88SJoel DahlThe same goes for Receive Packet Steering (RPS)
1163bf15fc88SJoel Dahland filtering of incoming traffic.
1164bf15fc88SJoel Dahl.Pp
1165bf15fc88SJoel Dahl.Nm
1166bf15fc88SJoel Dahl.Em does not use
1167bf15fc88SJoel Dahlfeatures such as
1168bf15fc88SJoel Dahl.Em checksum offloading , TCP segmentation offloading ,
1169bf15fc88SJoel Dahl.Em encryption , VLAN encapsulation/decapsulation ,
1170e91d04f7SChristian Bruefferetc.
1171bf15fc88SJoel DahlWhen using netmap to exchange packets with the host stack,
1172bf15fc88SJoel Dahlmake sure to disable these features.
1173