xref: /freebsd/share/man/man4/netmap.4 (revision 1bae9dc584272dd75dc4e04cb5d73be0e9fb562a)
117885a7bSLuigi Rizzo.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
268b8534bSLuigi Rizzo.\" All rights reserved.
368b8534bSLuigi Rizzo.\"
468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without
568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions
668b8534bSLuigi Rizzo.\" are met:
768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright
868b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer.
968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright
1068b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer in the
1168b8534bSLuigi Rizzo.\"    documentation and/or other materials provided with the distribution.
1268b8534bSLuigi Rizzo.\"
1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
2368b8534bSLuigi Rizzo.\" SUCH DAMAGE.
2468b8534bSLuigi Rizzo.\"
2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4)
2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix.
2768b8534bSLuigi Rizzo.\"
28*1bae9dc5SMark Johnston.Dd October 10, 2024
2968b8534bSLuigi Rizzo.Dt NETMAP 4
3068b8534bSLuigi Rizzo.Os
3168b8534bSLuigi Rizzo.Sh NAME
3268b8534bSLuigi Rizzo.Nm netmap
3368b8534bSLuigi Rizzo.Nd a framework for fast packet I/O
3468b8534bSLuigi Rizzo.Sh SYNOPSIS
3568b8534bSLuigi Rizzo.Cd device netmap
3668b8534bSLuigi Rizzo.Sh DESCRIPTION
3768b8534bSLuigi Rizzo.Nm
38ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O
3937e3a6d3SLuigi Rizzofor userspace and kernel clients, and for Virtual Machines.
40e91d04f7SChristian BruefferIt runs on
41966b76b4SMateusz Piotrowski.Fx ,
4237e3a6d3SLuigi RizzoLinux and some versions of Windows, and supports a variety of
4337e3a6d3SLuigi Rizzo.Nm netmap ports ,
4437e3a6d3SLuigi Rizzoincluding
4537e3a6d3SLuigi Rizzo.Bl -tag -width XXXX
4637e3a6d3SLuigi Rizzo.It Nm physical NIC ports
4737e3a6d3SLuigi Rizzoto access individual queues of network interfaces;
4837e3a6d3SLuigi Rizzo.It Nm host ports
4937e3a6d3SLuigi Rizzoto inject packets into the host stack;
5037e3a6d3SLuigi Rizzo.It Nm VALE ports
5137e3a6d3SLuigi Rizzoimplementing a very fast and modular in-kernel software switch/dataplane;
5237e3a6d3SLuigi Rizzo.It Nm netmap pipes
5337e3a6d3SLuigi Rizzoa shared memory packet transport channel;
5437e3a6d3SLuigi Rizzo.It Nm netmap monitors
5537e3a6d3SLuigi Rizzoa mechanism similar to
563f879a47SEnji Cooper.Xr bpf 4
5737e3a6d3SLuigi Rizzoto capture traffic
5837e3a6d3SLuigi Rizzo.El
59fa7db06bSLuigi Rizzo.Pp
6037e3a6d3SLuigi RizzoAll these
6137e3a6d3SLuigi Rizzo.Nm netmap ports
6237e3a6d3SLuigi Rizzoare accessed interchangeably with the same API,
6337e3a6d3SLuigi Rizzoand are at least one order of magnitude faster than
64fa7db06bSLuigi Rizzostandard OS mechanisms
6537e3a6d3SLuigi Rizzo(sockets, bpf, tun/tap interfaces, native switches, pipes).
6637e3a6d3SLuigi RizzoWith suitably fast hardware (NICs, PCIe buses, CPUs),
6737e3a6d3SLuigi Rizzopacket I/O using
6837e3a6d3SLuigi Rizzo.Nm
6937e3a6d3SLuigi Rizzoon supported NICs
7037e3a6d3SLuigi Rizzoreaches 14.88 million packets per second (Mpps)
7137e3a6d3SLuigi Rizzowith much less than one core on 10 Gbit/s NICs;
7237e3a6d3SLuigi Rizzo35-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
7337e3a6d3SLuigi Rizzoabout 20 Mpps per core for VALE ports;
7437e3a6d3SLuigi Rizzoand over 100 Mpps for
7537e3a6d3SLuigi Rizzo.Nm netmap pipes .
7637e3a6d3SLuigi RizzoNICs without native
7737e3a6d3SLuigi Rizzo.Nm
7837e3a6d3SLuigi Rizzosupport can still use the API in emulated mode,
7937e3a6d3SLuigi Rizzowhich uses unmodified device drivers and is 3-5 times faster than
803f879a47SEnji Cooper.Xr bpf 4
8137e3a6d3SLuigi Rizzoor raw sockets.
82ce3ee1e7SLuigi Rizzo.Pp
8317885a7bSLuigi RizzoUserspace clients can dynamically switch NICs into
8468b8534bSLuigi Rizzo.Nm
8517885a7bSLuigi Rizzomode and send and receive raw packets through
8617885a7bSLuigi Rizzomemory mapped buffers.
8717885a7bSLuigi RizzoSimilarly,
8817885a7bSLuigi Rizzo.Nm VALE
8937e3a6d3SLuigi Rizzoswitch instances and ports,
90fa7db06bSLuigi Rizzo.Nm netmap pipes
9137e3a6d3SLuigi Rizzoand
9237e3a6d3SLuigi Rizzo.Nm netmap monitors
93fa7db06bSLuigi Rizzocan be created dynamically,
9417885a7bSLuigi Rizzoproviding high speed packet I/O between processes,
9517885a7bSLuigi Rizzovirtual machines, NICs and the host stack.
9617885a7bSLuigi Rizzo.Pp
97fa7db06bSLuigi Rizzo.Nm
98e91d04f7SChristian Brueffersupports both non-blocking I/O through
99e91d04f7SChristian Brueffer.Xr ioctl 2 ,
100fa7db06bSLuigi Rizzosynchronization and blocking I/O through a file descriptor
101fa7db06bSLuigi Rizzoand standard OS mechanisms such as
102fa7db06bSLuigi Rizzo.Xr select 2 ,
103fa7db06bSLuigi Rizzo.Xr poll 2 ,
104668e070fSVincenzo Maffione.Xr kqueue 2
105e91d04f7SChristian Bruefferand
106668e070fSVincenzo Maffione.Xr epoll 7 .
10737e3a6d3SLuigi RizzoAll types of
10837e3a6d3SLuigi Rizzo.Nm netmap ports
10937e3a6d3SLuigi Rizzoand the
11037e3a6d3SLuigi Rizzo.Nm VALE switch
111fa7db06bSLuigi Rizzoare implemented by a single kernel module, which also emulates the
112fa7db06bSLuigi Rizzo.Nm
11337e3a6d3SLuigi RizzoAPI over standard drivers.
11417885a7bSLuigi RizzoFor best performance,
11568b8534bSLuigi Rizzo.Nm
11637e3a6d3SLuigi Rizzorequires native support in device drivers.
11737e3a6d3SLuigi RizzoA list of such devices is at the end of this document.
118ce3ee1e7SLuigi Rizzo.Pp
11917885a7bSLuigi RizzoIn the rest of this (long) manual page we document
12017885a7bSLuigi Rizzovarious aspects of the
121ce3ee1e7SLuigi Rizzo.Nm
12217885a7bSLuigi Rizzoand
123ce3ee1e7SLuigi Rizzo.Nm VALE
12417885a7bSLuigi Rizzoarchitecture, features and usage.
12517885a7bSLuigi Rizzo.Sh ARCHITECTURE
12617885a7bSLuigi Rizzo.Nm
12717885a7bSLuigi Rizzosupports raw packet I/O through a
12817885a7bSLuigi Rizzo.Em port ,
12917885a7bSLuigi Rizzowhich can be connected to a physical interface
13017885a7bSLuigi Rizzo.Em ( NIC ) ,
13117885a7bSLuigi Rizzoto the host stack,
13217885a7bSLuigi Rizzoor to a
13317885a7bSLuigi Rizzo.Nm VALE
13437e3a6d3SLuigi Rizzoswitch.
13517885a7bSLuigi RizzoPorts use preallocated circular queues of buffers
13617885a7bSLuigi Rizzo.Em ( rings )
13717885a7bSLuigi Rizzoresiding in an mmapped region.
13817885a7bSLuigi RizzoThere is one ring for each transmit/receive queue of a
13917885a7bSLuigi RizzoNIC or virtual port.
14017885a7bSLuigi RizzoAn additional ring pair connects to the host stack.
141ce3ee1e7SLuigi Rizzo.Pp
14217885a7bSLuigi RizzoAfter binding a file descriptor to a port, a
14317885a7bSLuigi Rizzo.Nm
14417885a7bSLuigi Rizzoclient can send or receive packets in batches through
14517885a7bSLuigi Rizzothe rings, and possibly implement zero-copy forwarding
14617885a7bSLuigi Rizzobetween ports.
147ce3ee1e7SLuigi Rizzo.Pp
14817885a7bSLuigi RizzoAll NICs operating in
14968b8534bSLuigi Rizzo.Nm
150ce3ee1e7SLuigi Rizzomode use the same memory region,
15117885a7bSLuigi Rizzoaccessible to all processes who own
152e91d04f7SChristian Brueffer.Pa /dev/netmap
15317885a7bSLuigi Rizzofile descriptors bound to NICs.
154fa7db06bSLuigi RizzoIndependent
15517885a7bSLuigi Rizzo.Nm VALE
156fa7db06bSLuigi Rizzoand
157fa7db06bSLuigi Rizzo.Nm netmap pipe
158fa7db06bSLuigi Rizzoports
159fa7db06bSLuigi Rizzoby default use separate memory regions,
160fa7db06bSLuigi Rizzobut can be independently configured to share memory.
16117885a7bSLuigi Rizzo.Sh ENTERING AND EXITING NETMAP MODE
162fa7db06bSLuigi RizzoThe following section describes the system calls to create
163fa7db06bSLuigi Rizzoand control
164fa7db06bSLuigi Rizzo.Nm netmap
165fa7db06bSLuigi Rizzoports (including
166fa7db06bSLuigi Rizzo.Nm VALE
167fa7db06bSLuigi Rizzoand
168fa7db06bSLuigi Rizzo.Nm netmap pipe
169fa7db06bSLuigi Rizzoports).
1703f879a47SEnji CooperSimpler, higher level functions are described in the
1713f879a47SEnji Cooper.Sx LIBRARIES
1723f879a47SEnji Coopersection.
173fa7db06bSLuigi Rizzo.Pp
17417885a7bSLuigi RizzoPorts and rings are created and controlled through a file descriptor,
17517885a7bSLuigi Rizzocreated by opening a special device
17617885a7bSLuigi Rizzo.Dl fd = open("/dev/netmap");
17717885a7bSLuigi Rizzoand then bound to a specific port with an
17817885a7bSLuigi Rizzo.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
17917885a7bSLuigi Rizzo.Pp
18017885a7bSLuigi Rizzo.Nm
18117885a7bSLuigi Rizzohas multiple modes of operation controlled by the
18217885a7bSLuigi Rizzo.Vt struct nmreq
18317885a7bSLuigi Rizzoargument.
18417885a7bSLuigi Rizzo.Va arg.nr_name
18537e3a6d3SLuigi Rizzospecifies the netmap port name, as follows:
18617885a7bSLuigi Rizzo.Bl -tag -width XXXX
1873f879a47SEnji Cooper.It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
18817885a7bSLuigi Rizzothe data path of the NIC is disconnected from the host stack,
18917885a7bSLuigi Rizzoand the file descriptor is bound to the NIC (one or all queues),
19017885a7bSLuigi Rizzoor to the host stack;
19137e3a6d3SLuigi Rizzo.It Dv valeSSS:PPP
19237e3a6d3SLuigi Rizzothe file descriptor is bound to port PPP of VALE switch SSS.
19337e3a6d3SLuigi RizzoSwitch instances and ports are dynamically created if necessary.
1943f879a47SEnji Cooper.Pp
19537e3a6d3SLuigi RizzoBoth SSS and PPP have the form [0-9a-zA-Z_]+ , the string
19637e3a6d3SLuigi Rizzocannot exceed IFNAMSIZ characters, and PPP cannot
19717885a7bSLuigi Rizzobe the name of any existing OS network interface.
19817885a7bSLuigi Rizzo.El
19917885a7bSLuigi Rizzo.Pp
20017885a7bSLuigi RizzoOn return,
20117885a7bSLuigi Rizzo.Va arg
20217885a7bSLuigi Rizzoindicates the size of the shared memory region,
20317885a7bSLuigi Rizzoand the number, size and location of all the
20417885a7bSLuigi Rizzo.Nm
20517885a7bSLuigi Rizzodata structures, which can be accessed by mmapping the memory
20617885a7bSLuigi Rizzo.Dl char *mem = mmap(0, arg.nr_memsize, fd);
20717885a7bSLuigi Rizzo.Pp
208e91d04f7SChristian BruefferNon-blocking I/O is done with special
20917885a7bSLuigi Rizzo.Xr ioctl 2
21017885a7bSLuigi Rizzo.Xr select 2
21117885a7bSLuigi Rizzoand
21217885a7bSLuigi Rizzo.Xr poll 2
21317885a7bSLuigi Rizzoon the file descriptor permit blocking I/O.
21417885a7bSLuigi Rizzo.Pp
21517885a7bSLuigi RizzoWhile a NIC is in
21617885a7bSLuigi Rizzo.Nm
21717885a7bSLuigi Rizzomode, the OS will still believe the interface is up and running.
21817885a7bSLuigi RizzoOS-generated packets for that NIC end up into a
21917885a7bSLuigi Rizzo.Nm
22017885a7bSLuigi Rizzoring, and another ring is used to send packets into the OS network stack.
22117885a7bSLuigi RizzoA
22217885a7bSLuigi Rizzo.Xr close 2
22317885a7bSLuigi Rizzoon the file descriptor removes the binding,
22417885a7bSLuigi Rizzoand returns the NIC to normal mode (reconnecting the data path
22517885a7bSLuigi Rizzoto the host stack), or destroys the virtual port.
22617885a7bSLuigi Rizzo.Sh DATA STRUCTURES
22717885a7bSLuigi RizzoThe data structures in the mmapped memory region are detailed in
228e91d04f7SChristian Brueffer.In sys/net/netmap.h ,
22917885a7bSLuigi Rizzowhich is the ultimate reference for the
23017885a7bSLuigi Rizzo.Nm
231e91d04f7SChristian BruefferAPI.
232e91d04f7SChristian BruefferThe main structures and fields are indicated below:
23368b8534bSLuigi Rizzo.Bl -tag -width XXX
23468b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface )
23568b8534bSLuigi Rizzo.Bd -literal
23668b8534bSLuigi Rizzostruct netmap_if {
23717885a7bSLuigi Rizzo    ...
23817885a7bSLuigi Rizzo    const uint32_t   ni_flags;      /* properties              */
23917885a7bSLuigi Rizzo    ...
24017885a7bSLuigi Rizzo    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
24117885a7bSLuigi Rizzo    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
242fa7db06bSLuigi Rizzo    uint32_t         ni_bufs_head;  /* head of extra bufs list */
24317885a7bSLuigi Rizzo    ...
24468b8534bSLuigi Rizzo};
24568b8534bSLuigi Rizzo.Ed
246ce3ee1e7SLuigi Rizzo.Pp
24717885a7bSLuigi RizzoIndicates the number of available rings
24817885a7bSLuigi Rizzo.Pa ( struct netmap_rings )
24917885a7bSLuigi Rizzoand their position in the mmapped region.
25017885a7bSLuigi RizzoThe number of tx and rx rings
25117885a7bSLuigi Rizzo.Pa ( ni_tx_rings , ni_rx_rings )
25217885a7bSLuigi Rizzonormally depends on the hardware.
25317885a7bSLuigi RizzoNICs also have an extra tx/rx ring pair connected to the host stack.
25417885a7bSLuigi Rizzo.Em NIOCREGIF
255fa7db06bSLuigi Rizzocan also request additional unbound buffers in the same memory space,
256fa7db06bSLuigi Rizzoto be used as temporary storage for packets.
257668e070fSVincenzo MaffioneThe number of extra
258668e070fSVincenzo Maffionebuffers is specified in the
259668e070fSVincenzo Maffione.Va arg.nr_arg3
260668e070fSVincenzo Maffionefield.
261668e070fSVincenzo MaffioneOn success, the kernel writes back to
262668e070fSVincenzo Maffione.Va arg.nr_arg3
263668e070fSVincenzo Maffionethe number of extra buffers actually allocated (they may be less
264668e070fSVincenzo Maffionethan the amount requested if the memory space ran out of buffers).
265fa7db06bSLuigi Rizzo.Pa ni_bufs_head
266668e070fSVincenzo Maffionecontains the index of the first of these extra buffers,
267fa7db06bSLuigi Rizzowhich are connected in a list (the first uint32_t of each
268fa7db06bSLuigi Rizzobuffer being the index of the next buffer in the list).
269e91d04f7SChristian BruefferA
270e91d04f7SChristian Brueffer.Dv 0
271e91d04f7SChristian Bruefferindicates the end of the list.
272668e070fSVincenzo MaffioneThe application is free to modify
273668e070fSVincenzo Maffionethis list and use the buffers (i.e., binding them to the slots of a
274668e070fSVincenzo Maffionenetmap ring).
275668e070fSVincenzo MaffioneWhen closing the netmap file descriptor,
276668e070fSVincenzo Maffionethe kernel frees the buffers contained in the list pointed by
277668e070fSVincenzo Maffione.Pa ni_bufs_head
278668e070fSVincenzo Maffione, irrespectively of the buffers originally provided by the kernel on
279668e070fSVincenzo Maffione.Em NIOCREGIF .
28017885a7bSLuigi Rizzo.It Dv struct netmap_ring (one per ring )
28168b8534bSLuigi Rizzo.Bd -literal
28268b8534bSLuigi Rizzostruct netmap_ring {
28317885a7bSLuigi Rizzo    ...
28417885a7bSLuigi Rizzo    const uint32_t num_slots;   /* slots in each ring            */
28517885a7bSLuigi Rizzo    const uint32_t nr_buf_size; /* size of each buffer           */
28617885a7bSLuigi Rizzo    ...
28717885a7bSLuigi Rizzo    uint32_t       head;        /* (u) first buf owned by user   */
28817885a7bSLuigi Rizzo    uint32_t       cur;         /* (u) wakeup position           */
28917885a7bSLuigi Rizzo    const uint32_t tail;        /* (k) first buf owned by kernel */
29017885a7bSLuigi Rizzo    ...
29117885a7bSLuigi Rizzo    uint32_t       flags;
29217885a7bSLuigi Rizzo    struct timeval ts;          /* (k) time of last rxsync()     */
29317885a7bSLuigi Rizzo    ...
294ce3ee1e7SLuigi Rizzo    struct netmap_slot slot[0]; /* array of slots                */
29568b8534bSLuigi Rizzo}
29668b8534bSLuigi Rizzo.Ed
297ce3ee1e7SLuigi Rizzo.Pp
29817885a7bSLuigi RizzoImplements transmit and receive rings, with read/write
299e91d04f7SChristian Bruefferpointers, metadata and an array of
300e91d04f7SChristian Brueffer.Em slots
30117885a7bSLuigi Rizzodescribing the buffers.
30217885a7bSLuigi Rizzo.It Dv struct netmap_slot (one per buffer )
30368b8534bSLuigi Rizzo.Bd -literal
30468b8534bSLuigi Rizzostruct netmap_slot {
30568b8534bSLuigi Rizzo    uint32_t buf_idx;           /* buffer index                 */
30668b8534bSLuigi Rizzo    uint16_t len;               /* packet length                */
30768b8534bSLuigi Rizzo    uint16_t flags;             /* buf changed, etc.            */
30817885a7bSLuigi Rizzo    uint64_t ptr;               /* address for indirect buffers */
30968b8534bSLuigi Rizzo};
31068b8534bSLuigi Rizzo.Ed
31117885a7bSLuigi Rizzo.Pp
31217885a7bSLuigi RizzoDescribes a packet buffer, which normally is identified by
31317885a7bSLuigi Rizzoan index and resides in the mmapped region.
31468b8534bSLuigi Rizzo.It Dv packet buffers
31517885a7bSLuigi RizzoFixed size (normally 2 KB) packet buffers allocated by the kernel.
316ce3ee1e7SLuigi Rizzo.El
317ce3ee1e7SLuigi Rizzo.Pp
31817885a7bSLuigi RizzoThe offset of the
31917885a7bSLuigi Rizzo.Pa struct netmap_if
32017885a7bSLuigi Rizzoin the mmapped region is indicated by the
32117885a7bSLuigi Rizzo.Pa nr_offset
32217885a7bSLuigi Rizzofield in the structure returned by
323e91d04f7SChristian Brueffer.Dv NIOCREGIF .
32417885a7bSLuigi RizzoFrom there, all other objects are reachable through
32517885a7bSLuigi Rizzorelative references (offsets or indexes).
326e91d04f7SChristian BruefferMacros and functions in
327e91d04f7SChristian Brueffer.In net/netmap_user.h
32817885a7bSLuigi Rizzohelp converting them into actual pointers:
32917885a7bSLuigi Rizzo.Pp
33017885a7bSLuigi Rizzo.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
33117885a7bSLuigi Rizzo.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
33217885a7bSLuigi Rizzo.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
33317885a7bSLuigi Rizzo.Pp
33417885a7bSLuigi Rizzo.Dl char *buf = NETMAP_BUF(ring, buffer_index);
33517885a7bSLuigi Rizzo.Sh RINGS, BUFFERS AND DATA I/O
33617885a7bSLuigi Rizzo.Va Rings
33717885a7bSLuigi Rizzoare circular queues of packets with three indexes/pointers
33817885a7bSLuigi Rizzo.Va ( head , cur , tail ) ;
33917885a7bSLuigi Rizzoone slot is always kept empty.
34017885a7bSLuigi RizzoThe ring size
34117885a7bSLuigi Rizzo.Va ( num_slots )
34217885a7bSLuigi Rizzoshould not be assumed to be a power of two.
34317885a7bSLuigi Rizzo.Pp
34417885a7bSLuigi Rizzo.Va head
34517885a7bSLuigi Rizzois the first slot available to userspace;
3463f879a47SEnji Cooper.Pp
34717885a7bSLuigi Rizzo.Va cur
34817885a7bSLuigi Rizzois the wakeup point:
34917885a7bSLuigi Rizzoselect/poll will unblock when
35017885a7bSLuigi Rizzo.Va tail
35117885a7bSLuigi Rizzopasses
35217885a7bSLuigi Rizzo.Va cur ;
3533f879a47SEnji Cooper.Pp
35417885a7bSLuigi Rizzo.Va tail
35517885a7bSLuigi Rizzois the first slot reserved to the kernel.
35617885a7bSLuigi Rizzo.Pp
357e91d04f7SChristian BruefferSlot indexes
358e91d04f7SChristian Brueffer.Em must
359e91d04f7SChristian Bruefferonly move forward;
36017885a7bSLuigi Rizzofor convenience, the function
36117885a7bSLuigi Rizzo.Dl nm_ring_next(ring, index)
36217885a7bSLuigi Rizzoreturns the next index modulo the ring size.
36317885a7bSLuigi Rizzo.Pp
36417885a7bSLuigi Rizzo.Va head
36517885a7bSLuigi Rizzoand
36617885a7bSLuigi Rizzo.Va cur
36717885a7bSLuigi Rizzoare only modified by the user program;
36817885a7bSLuigi Rizzo.Va tail
36917885a7bSLuigi Rizzois only modified by the kernel.
37017885a7bSLuigi RizzoThe kernel only reads/writes the
37117885a7bSLuigi Rizzo.Vt struct netmap_ring
37217885a7bSLuigi Rizzoslots and buffers
37317885a7bSLuigi Rizzoduring the execution of a netmap-related system call.
37417885a7bSLuigi RizzoThe only exception are slots (and buffers) in the range
37517885a7bSLuigi Rizzo.Va tail\  . . . head-1 ,
37617885a7bSLuigi Rizzothat are explicitly assigned to the kernel.
37717885a7bSLuigi Rizzo.Ss TRANSMIT RINGS
37817885a7bSLuigi RizzoOn transmit rings, after a
37917885a7bSLuigi Rizzo.Nm
38017885a7bSLuigi Rizzosystem call, slots in the range
38117885a7bSLuigi Rizzo.Va head\  . . . tail-1
38217885a7bSLuigi Rizzoare available for transmission.
38317885a7bSLuigi RizzoUser code should fill the slots sequentially
38417885a7bSLuigi Rizzoand advance
38517885a7bSLuigi Rizzo.Va head
38617885a7bSLuigi Rizzoand
38717885a7bSLuigi Rizzo.Va cur
38817885a7bSLuigi Rizzopast slots ready to transmit.
38917885a7bSLuigi Rizzo.Va cur
39017885a7bSLuigi Rizzomay be moved further ahead if the user code needs
39117885a7bSLuigi Rizzomore slots before further transmissions (see
39217885a7bSLuigi Rizzo.Sx SCATTER GATHER I/O ) .
39317885a7bSLuigi Rizzo.Pp
39417885a7bSLuigi RizzoAt the next NIOCTXSYNC/select()/poll(),
39517885a7bSLuigi Rizzoslots up to
39617885a7bSLuigi Rizzo.Va head-1
39717885a7bSLuigi Rizzoare pushed to the port, and
39817885a7bSLuigi Rizzo.Va tail
39917885a7bSLuigi Rizzomay advance if further slots have become available.
40017885a7bSLuigi RizzoBelow is an example of the evolution of a TX ring:
40117885a7bSLuigi Rizzo.Bd -literal
40217885a7bSLuigi Rizzo    after the syscall, slots between cur and tail are (a)vailable
40317885a7bSLuigi Rizzo              head=cur   tail
40417885a7bSLuigi Rizzo               |          |
40517885a7bSLuigi Rizzo               v          v
40617885a7bSLuigi Rizzo     TX  [.....aaaaaaaaaaa.............]
40717885a7bSLuigi Rizzo
40817885a7bSLuigi Rizzo    user creates new packets to (T)ransmit
40917885a7bSLuigi Rizzo                head=cur tail
41017885a7bSLuigi Rizzo                    |     |
41117885a7bSLuigi Rizzo                    v     v
41217885a7bSLuigi Rizzo     TX  [.....TTTTTaaaaaa.............]
41317885a7bSLuigi Rizzo
41417885a7bSLuigi Rizzo    NIOCTXSYNC/poll()/select() sends packets and reports new slots
41517885a7bSLuigi Rizzo                head=cur      tail
41617885a7bSLuigi Rizzo                    |          |
41717885a7bSLuigi Rizzo                    v          v
41817885a7bSLuigi Rizzo     TX  [..........aaaaaaaaaaa........]
41917885a7bSLuigi Rizzo.Ed
42017885a7bSLuigi Rizzo.Pp
421e91d04f7SChristian Brueffer.Fn select
422e91d04f7SChristian Bruefferand
423e91d04f7SChristian Brueffer.Fn poll
4243f879a47SEnji Cooperwill block if there is no space in the ring, i.e.,
42517885a7bSLuigi Rizzo.Dl ring->cur == ring->tail
42617885a7bSLuigi Rizzoand return when new slots have become available.
42717885a7bSLuigi Rizzo.Pp
42817885a7bSLuigi RizzoHigh speed applications may want to amortize the cost of system calls
42917885a7bSLuigi Rizzoby preparing as many packets as possible before issuing them.
43017885a7bSLuigi Rizzo.Pp
43117885a7bSLuigi RizzoA transmit ring with pending transmissions has
43217885a7bSLuigi Rizzo.Dl ring->head != ring->tail + 1 (modulo the ring size).
43317885a7bSLuigi RizzoThe function
43417885a7bSLuigi Rizzo.Va int nm_tx_pending(ring)
43517885a7bSLuigi Rizzoimplements this test.
43617885a7bSLuigi Rizzo.Ss RECEIVE RINGS
43717885a7bSLuigi RizzoOn receive rings, after a
43817885a7bSLuigi Rizzo.Nm
43917885a7bSLuigi Rizzosystem call, the slots in the range
44017885a7bSLuigi Rizzo.Va head\& . . . tail-1
44117885a7bSLuigi Rizzocontain received packets.
44217885a7bSLuigi RizzoUser code should process them and advance
44317885a7bSLuigi Rizzo.Va head
44417885a7bSLuigi Rizzoand
44517885a7bSLuigi Rizzo.Va cur
44617885a7bSLuigi Rizzopast slots it wants to return to the kernel.
44717885a7bSLuigi Rizzo.Va cur
44817885a7bSLuigi Rizzomay be moved further ahead if the user code wants to
44917885a7bSLuigi Rizzowait for more packets
45017885a7bSLuigi Rizzowithout returning all the previous slots to the kernel.
45117885a7bSLuigi Rizzo.Pp
45217885a7bSLuigi RizzoAt the next NIOCRXSYNC/select()/poll(),
45317885a7bSLuigi Rizzoslots up to
45417885a7bSLuigi Rizzo.Va head-1
45517885a7bSLuigi Rizzoare returned to the kernel for further receives, and
45617885a7bSLuigi Rizzo.Va tail
45717885a7bSLuigi Rizzomay advance to report new incoming packets.
4583f879a47SEnji Cooper.Pp
45917885a7bSLuigi RizzoBelow is an example of the evolution of an RX ring:
46017885a7bSLuigi Rizzo.Bd -literal
46117885a7bSLuigi Rizzo    after the syscall, there are some (h)eld and some (R)eceived slots
46217885a7bSLuigi Rizzo           head  cur     tail
46317885a7bSLuigi Rizzo            |     |       |
46417885a7bSLuigi Rizzo            v     v       v
46517885a7bSLuigi Rizzo     RX  [..hhhhhhRRRRRRRR..........]
46617885a7bSLuigi Rizzo
46717885a7bSLuigi Rizzo    user advances head and cur, releasing some slots and holding others
46817885a7bSLuigi Rizzo               head cur  tail
46917885a7bSLuigi Rizzo                 |  |     |
47017885a7bSLuigi Rizzo                 v  v     v
47117885a7bSLuigi Rizzo     RX  [..*****hhhRRRRRR...........]
47217885a7bSLuigi Rizzo
47317885a7bSLuigi Rizzo    NICRXSYNC/poll()/select() recovers slots and reports new packets
47417885a7bSLuigi Rizzo               head cur        tail
47517885a7bSLuigi Rizzo                 |  |           |
47617885a7bSLuigi Rizzo                 v  v           v
47717885a7bSLuigi Rizzo     RX  [.......hhhRRRRRRRRRRRR....]
47817885a7bSLuigi Rizzo.Ed
47917885a7bSLuigi Rizzo.Sh SLOTS AND PACKET BUFFERS
48017885a7bSLuigi RizzoNormally, packets should be stored in the netmap-allocated buffers
48117885a7bSLuigi Rizzoassigned to slots when ports are bound to a file descriptor.
48217885a7bSLuigi RizzoOne packet is fully contained in a single buffer.
48317885a7bSLuigi Rizzo.Pp
48417885a7bSLuigi RizzoThe following flags affect slot and buffer processing:
485ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX
486ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED
487e91d04f7SChristian Brueffer.Em must
488e91d04f7SChristian Bruefferbe used when the
489e91d04f7SChristian Brueffer.Va buf_idx
490e91d04f7SChristian Bruefferin the slot is changed.
49117885a7bSLuigi RizzoThis can be used to implement
49217885a7bSLuigi Rizzozero-copy forwarding, see
49317885a7bSLuigi Rizzo.Sx ZERO-COPY FORWARDING .
494ce3ee1e7SLuigi Rizzo.It NS_REPORT
49517885a7bSLuigi Rizzoreports when this buffer has been transmitted.
496ce3ee1e7SLuigi RizzoNormally,
497ce3ee1e7SLuigi Rizzo.Nm
498ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals
499e91d04f7SChristian Brueffercan be delayed indefinitely.
500e91d04f7SChristian BruefferThis flag helps detect
501e91d04f7SChristian Bruefferwhen packets have been sent and a file descriptor can be closed.
502ce3ee1e7SLuigi Rizzo.It NS_FORWARD
503668e070fSVincenzo MaffioneWhen a ring is in 'transparent' mode,
504668e070fSVincenzo Maffionepackets marked with this flag by the user application are forwarded to the
505668e070fSVincenzo Maffioneother endpoint at the next system call, thus restoring (in a selective way)
50617885a7bSLuigi Rizzothe connection between a NIC and the host stack.
507ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN
508e91d04f7SChristian Brueffertells the forwarding code that the source MAC address for this
50917885a7bSLuigi Rizzopacket must not be used in the learning bridge code.
510ce3ee1e7SLuigi Rizzo.It NS_INDIRECT
511e91d04f7SChristian Bruefferindicates that the packet's payload is in a user-supplied buffer
51217885a7bSLuigi Rizzowhose user virtual address is in the 'ptr' field of the slot.
513ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes.
5143f879a47SEnji Cooper.Pp
51517885a7bSLuigi RizzoThis is only supported on the transmit ring of
51617885a7bSLuigi Rizzo.Nm VALE
51717885a7bSLuigi Rizzoports, and it helps reducing data copies in the interconnection
51817885a7bSLuigi Rizzoof virtual machines.
519ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG
520ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers;
521ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear.
522ce3ee1e7SLuigi Rizzo.El
52317885a7bSLuigi Rizzo.Sh SCATTER GATHER I/O
52417885a7bSLuigi RizzoPackets can span multiple slots if the
52517885a7bSLuigi Rizzo.Va NS_MOREFRAG
52617885a7bSLuigi Rizzoflag is set in all but the last slot.
52717885a7bSLuigi RizzoThe maximum length of a chain is 64 buffers.
52817885a7bSLuigi RizzoThis is normally used with
52917885a7bSLuigi Rizzo.Nm VALE
53017885a7bSLuigi Rizzoports when connecting virtual machines, as they generate large
53117885a7bSLuigi RizzoTSO segments that are not split unless they reach a physical device.
53217885a7bSLuigi Rizzo.Pp
53317885a7bSLuigi RizzoNOTE: The length field always refers to the individual
53417885a7bSLuigi Rizzofragment; there is no place with the total length of a packet.
53517885a7bSLuigi Rizzo.Pp
53617885a7bSLuigi RizzoOn receive rings the macro
53717885a7bSLuigi Rizzo.Va NS_RFRAGS(slot)
53817885a7bSLuigi Rizzoindicates the remaining number of slots for this packet,
53917885a7bSLuigi Rizzoincluding the current one.
54017885a7bSLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set.
54113a5d88fSLuigi Rizzo.Sh IOCTLS
54268b8534bSLuigi Rizzo.Nm
54317885a7bSLuigi Rizzouses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
544e91d04f7SChristian Bruefferfor non-blocking I/O.
545e91d04f7SChristian BruefferThey take no argument.
54617885a7bSLuigi RizzoTwo more ioctls (NIOCGINFO, NIOCREGIF) are used
54717885a7bSLuigi Rizzoto query and configure ports, with the following argument:
54868b8534bSLuigi Rizzo.Bd -literal
54968b8534bSLuigi Rizzostruct nmreq {
55017885a7bSLuigi Rizzo    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
55117885a7bSLuigi Rizzo    uint32_t  nr_version;        /* (i) API version                */
55217885a7bSLuigi Rizzo    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
55317885a7bSLuigi Rizzo    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
554fa7db06bSLuigi Rizzo    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
555fa7db06bSLuigi Rizzo    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
556fa7db06bSLuigi Rizzo    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
557e91d04f7SChristian Brueffer    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
558fa7db06bSLuigi Rizzo    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
55917885a7bSLuigi Rizzo    uint16_t  nr_cmd;            /* (i) special command            */
560fa7db06bSLuigi Rizzo    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
561fa7db06bSLuigi Rizzo    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
562fa7db06bSLuigi Rizzo    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
563fa7db06bSLuigi Rizzo    uint32_t  nr_flags           /* (i/o) open mode                */
56417885a7bSLuigi Rizzo    ...
56568b8534bSLuigi Rizzo};
56668b8534bSLuigi Rizzo.Ed
56768b8534bSLuigi Rizzo.Pp
56817885a7bSLuigi RizzoA file descriptor obtained through
56917885a7bSLuigi Rizzo.Pa /dev/netmap
57017885a7bSLuigi Rizzoalso supports the ioctl supported by network devices, see
57117885a7bSLuigi Rizzo.Xr netintro 4 .
57268b8534bSLuigi Rizzo.Bl -tag -width XXXX
57368b8534bSLuigi Rizzo.It Dv NIOCGINFO
57417885a7bSLuigi Rizzoreturns EINVAL if the named port does not support netmap.
575ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information
57617885a7bSLuigi Rizzoabout the port.
577ce3ee1e7SLuigi RizzoNote that all the information below can change before the
578ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode.
57917885a7bSLuigi Rizzo.Bl -tag -width XX
58017885a7bSLuigi Rizzo.It Pa nr_memsize
58117885a7bSLuigi Rizzoindicates the size of the
58217885a7bSLuigi Rizzo.Nm
583e91d04f7SChristian Brueffermemory region.
584e91d04f7SChristian BruefferNICs in
58517885a7bSLuigi Rizzo.Nm
58617885a7bSLuigi Rizzomode all share the same memory region,
58717885a7bSLuigi Rizzowhereas
58817885a7bSLuigi Rizzo.Nm VALE
58917885a7bSLuigi Rizzoports have independent regions for each port.
59017885a7bSLuigi Rizzo.It Pa nr_tx_slots , nr_rx_slots
591ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings.
59217885a7bSLuigi Rizzo.It Pa nr_tx_rings , nr_rx_rings
593ce3ee1e7SLuigi Rizzoindicate the number of transmit
594ce3ee1e7SLuigi Rizzoand receive rings.
595ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime
5963f879a47SEnji Cooperusing interface-specific functions (e.g.,
5973f879a47SEnji Cooper.Xr ethtool 8
59817885a7bSLuigi Rizzo).
59917885a7bSLuigi Rizzo.El
60068b8534bSLuigi Rizzo.It Dv NIOCREGIF
60117885a7bSLuigi Rizzobinds the port named in
60217885a7bSLuigi Rizzo.Va nr_name
603e91d04f7SChristian Bruefferto the file descriptor.
604e91d04f7SChristian BruefferFor a physical device this also switches it into
60517885a7bSLuigi Rizzo.Nm
60617885a7bSLuigi Rizzomode, disconnecting
60717885a7bSLuigi Rizzoit from the host stack.
60817885a7bSLuigi RizzoMultiple file descriptors can be bound to the same port,
60917885a7bSLuigi Rizzowith proper synchronization left to the user.
61017885a7bSLuigi Rizzo.Pp
61137e3a6d3SLuigi RizzoThe recommended way to bind a file descriptor to a port is
61237e3a6d3SLuigi Rizzoto use function
61337e3a6d3SLuigi Rizzo.Va nm_open(..)
61437e3a6d3SLuigi Rizzo(see
6153f879a47SEnji Cooper.Sx LIBRARIES )
61637e3a6d3SLuigi Rizzowhich parses names to access specific port types and
61737e3a6d3SLuigi Rizzoenable features.
61837e3a6d3SLuigi RizzoIn the following we document the main features.
61937e3a6d3SLuigi Rizzo.Pp
620fa7db06bSLuigi Rizzo.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
621fa7db06bSLuigi Rizzo.Em netmap pipe ,
622fa7db06bSLuigi Rizzoconsisting of two netmap ports with a crossover connection.
623fa7db06bSLuigi RizzoA netmap pipe share the same memory space of the parent port,
624fa7db06bSLuigi Rizzoand is meant to enable configuration where a master process acts
625fa7db06bSLuigi Rizzoas a dispatcher towards slave processes.
626fa7db06bSLuigi Rizzo.Pp
627fa7db06bSLuigi RizzoTo enable this function, the
628fa7db06bSLuigi Rizzo.Pa nr_arg1
629fa7db06bSLuigi Rizzofield of the structure can be used as a hint to the kernel to
630fa7db06bSLuigi Rizzoindicate how many pipes we expect to use, and reserve extra space
631fa7db06bSLuigi Rizzoin the memory region.
632fa7db06bSLuigi Rizzo.Pp
633fa7db06bSLuigi RizzoOn return, it gives the same info as NIOCGINFO,
634fa7db06bSLuigi Rizzowith
635fa7db06bSLuigi Rizzo.Pa nr_ringid
636fa7db06bSLuigi Rizzoand
637fa7db06bSLuigi Rizzo.Pa nr_flags
638fa7db06bSLuigi Rizzoindicating the identity of the rings controlled through the file
63968b8534bSLuigi Rizzodescriptor.
64068b8534bSLuigi Rizzo.Pp
641fa7db06bSLuigi Rizzo.Va nr_flags
64217885a7bSLuigi Rizzo.Va nr_ringid
64317885a7bSLuigi Rizzoselects which rings are controlled through this file descriptor.
644fa7db06bSLuigi RizzoPossible values of
645fa7db06bSLuigi Rizzo.Pa nr_flags
646fa7db06bSLuigi Rizzoare indicated below, together with the naming schemes
647fa7db06bSLuigi Rizzothat application libraries (such as the
648fa7db06bSLuigi Rizzo.Nm nm_open
649fa7db06bSLuigi Rizzoindicated below) can use to indicate the specific set of rings.
650fa7db06bSLuigi RizzoIn the example below, "netmap:foo" is any valid netmap port name.
65168b8534bSLuigi Rizzo.Bl -tag -width XXXXX
652fa7db06bSLuigi Rizzo.It NR_REG_ALL_NIC                         "netmap:foo"
653fa7db06bSLuigi Rizzo(default) all hardware ring pairs
654415dfa83SMaxim Sobolev.It NR_REG_SW            "netmap:foo^"
65517885a7bSLuigi Rizzothe ``host rings'', connecting to the host stack.
65673e77cf9SAllan Jude.It NR_REG_NIC_SW        "netmap:foo*"
657fa7db06bSLuigi Rizzoall hardware rings and the host rings
658fa7db06bSLuigi Rizzo.It NR_REG_ONE_NIC       "netmap:foo-i"
659fa7db06bSLuigi Rizzoonly the i-th hardware ring pair, where the number is in
660fa7db06bSLuigi Rizzo.Pa nr_ringid ;
661fa7db06bSLuigi Rizzo.It NR_REG_PIPE_MASTER  "netmap:foo{i"
662fa7db06bSLuigi Rizzothe master side of the netmap pipe whose identifier (i) is in
663fa7db06bSLuigi Rizzo.Pa nr_ringid ;
664fa7db06bSLuigi Rizzo.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
665fa7db06bSLuigi Rizzothe slave side of the netmap pipe whose identifier (i) is in
666fa7db06bSLuigi Rizzo.Pa nr_ringid .
667fa7db06bSLuigi Rizzo.Pp
668fa7db06bSLuigi RizzoThe identifier of a pipe must be thought as part of the pipe name,
669e91d04f7SChristian Bruefferand does not need to be sequential.
670e91d04f7SChristian BruefferOn return the pipe
671fa7db06bSLuigi Rizzowill only have a single ring pair with index 0,
672e91d04f7SChristian Bruefferirrespective of the value of
673e91d04f7SChristian Brueffer.Va i .
67468b8534bSLuigi Rizzo.El
67517885a7bSLuigi Rizzo.Pp
67668b8534bSLuigi RizzoBy default, a
67717885a7bSLuigi Rizzo.Xr poll 2
67868b8534bSLuigi Rizzoor
67917885a7bSLuigi Rizzo.Xr select 2
68068b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if
68168b8534bSLuigi Rizzono write events are specified.
68268b8534bSLuigi RizzoThe feature can be disabled by or-ing
683415dfa83SMaxim Sobolev.Va NETMAP_NO_TX_POLL
68417885a7bSLuigi Rizzoto the value written to
68517885a7bSLuigi Rizzo.Va nr_ringid .
68617885a7bSLuigi RizzoWhen this feature is used,
68717885a7bSLuigi Rizzopackets are transmitted only on
68817885a7bSLuigi Rizzo.Va ioctl(NIOCTXSYNC)
689668e070fSVincenzo Maffioneor
690668e070fSVincenzo Maffione.Va select() /
691668e070fSVincenzo Maffione.Va poll()
692668e070fSVincenzo Maffioneare called with a write event (POLLOUT/wfdset) or a full ring.
693ce3ee1e7SLuigi Rizzo.Pp
694ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a
695723180daSVincenzo Maffione.Nm VALE
696ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default,
697ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
69868b8534bSLuigi Rizzo.It Dv NIOCTXSYNC
69968b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the
70068b8534bSLuigi Rizzonumber of slots available for transmission.
70168b8534bSLuigi Rizzo.It Dv NIOCRXSYNC
70268b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available
70368b8534bSLuigi Rizzopackets.
70468b8534bSLuigi Rizzo.El
705668e070fSVincenzo Maffione.Sh SELECT, POLL, EPOLL, KQUEUE
70617885a7bSLuigi Rizzo.Xr select 2
70717885a7bSLuigi Rizzoand
70817885a7bSLuigi Rizzo.Xr poll 2
70917885a7bSLuigi Rizzoon a
71017885a7bSLuigi Rizzo.Nm
71117885a7bSLuigi Rizzofile descriptor process rings as indicated in
71217885a7bSLuigi Rizzo.Sx TRANSMIT RINGS
71317885a7bSLuigi Rizzoand
714fa7db06bSLuigi Rizzo.Sx RECEIVE RINGS ,
715fa7db06bSLuigi Rizzorespectively when write (POLLOUT) and read (POLLIN) events are requested.
716fa7db06bSLuigi RizzoBoth block if no slots are available in the ring
717fa7db06bSLuigi Rizzo.Va ( ring->cur == ring->tail ) .
718fa7db06bSLuigi RizzoDepending on the platform,
719668e070fSVincenzo Maffione.Xr epoll 7
720fa7db06bSLuigi Rizzoand
721fa7db06bSLuigi Rizzo.Xr kqueue 2
722fa7db06bSLuigi Rizzoare supported too.
72317885a7bSLuigi Rizzo.Pp
724fa7db06bSLuigi RizzoPackets in transmit rings are normally pushed out
725fa7db06bSLuigi Rizzo(and buffers reclaimed) even without
726e91d04f7SChristian Bruefferrequesting write events.
727e91d04f7SChristian BruefferPassing the
728e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL
729e91d04f7SChristian Bruefferflag to
73017885a7bSLuigi Rizzo.Em NIOCREGIF
73117885a7bSLuigi Rizzodisables this feature.
732fa7db06bSLuigi RizzoBy default, receive rings are processed only if read
733e91d04f7SChristian Bruefferevents are requested.
734e91d04f7SChristian BruefferPassing the
735e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL
736e91d04f7SChristian Bruefferflag to
737fa7db06bSLuigi Rizzo.Em NIOCREGIF updates receive rings even without read events.
738668e070fSVincenzo MaffioneNote that on
739668e070fSVincenzo Maffione.Xr epoll 7
740668e070fSVincenzo Maffioneand
741668e070fSVincenzo Maffione.Xr kqueue 2 ,
742e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL
743e91d04f7SChristian Bruefferand
744e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL
745fa7db06bSLuigi Rizzoonly have an effect when some event is posted for the file descriptor.
74617885a7bSLuigi Rizzo.Sh LIBRARIES
74717885a7bSLuigi RizzoThe
74817885a7bSLuigi Rizzo.Nm
74917885a7bSLuigi RizzoAPI is supposed to be used directly, both because of its simplicity and
75017885a7bSLuigi Rizzofor efficient integration with applications.
75117885a7bSLuigi Rizzo.Pp
752e91d04f7SChristian BruefferFor convenience, the
753e91d04f7SChristian Brueffer.In net/netmap_user.h
75417885a7bSLuigi Rizzoheader provides a few macros and functions to ease creating
75517885a7bSLuigi Rizzoa file descriptor and doing I/O with a
75617885a7bSLuigi Rizzo.Nm
757e91d04f7SChristian Bruefferport.
758e91d04f7SChristian BruefferThese are loosely modeled after the
75917885a7bSLuigi Rizzo.Xr pcap 3
76017885a7bSLuigi RizzoAPI, to ease porting of libpcap-based applications to
76117885a7bSLuigi Rizzo.Nm .
76217885a7bSLuigi RizzoTo use these extra functions, programs should
76317885a7bSLuigi Rizzo.Dl #define NETMAP_WITH_LIBS
76417885a7bSLuigi Rizzobefore
76517885a7bSLuigi Rizzo.Dl #include <net/netmap_user.h>
76617885a7bSLuigi Rizzo.Pp
76717885a7bSLuigi RizzoThe following functions are available:
76817885a7bSLuigi Rizzo.Bl -tag -width XXXXX
769fa7db06bSLuigi Rizzo.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg )
77017885a7bSLuigi Rizzosimilar to
771668e070fSVincenzo Maffione.Xr pcap_open_live 3 ,
77217885a7bSLuigi Rizzobinds a file descriptor to a port.
77317885a7bSLuigi Rizzo.Bl -tag -width XX
77417885a7bSLuigi Rizzo.It Va ifname
77537e3a6d3SLuigi Rizzois a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
77617885a7bSLuigi Rizzo.Nm VALE
77717885a7bSLuigi Rizzoport.
778fa7db06bSLuigi Rizzo.It Va req
779fa7db06bSLuigi Rizzoprovides the initial values for the argument to the NIOCREGIF ioctl.
780fa7db06bSLuigi RizzoThe nm_flags and nm_ringid values are overwritten by parsing
781fa7db06bSLuigi Rizzoifname and flags, and other fields can be overridden through
782fa7db06bSLuigi Rizzothe other two arguments.
783fa7db06bSLuigi Rizzo.It Va arg
7843f879a47SEnji Cooperpoints to a struct nm_desc containing arguments (e.g., from a previously
785fa7db06bSLuigi Rizzoopen file descriptor) that should override the defaults.
786fa7db06bSLuigi RizzoThe fields are used as described below
78717885a7bSLuigi Rizzo.It Va flags
788fa7db06bSLuigi Rizzocan be set to a combination of the following flags:
789fa7db06bSLuigi Rizzo.Va NETMAP_NO_TX_POLL ,
790fa7db06bSLuigi Rizzo.Va NETMAP_DO_RX_POLL
791fa7db06bSLuigi Rizzo(copied into nr_ringid);
792668e070fSVincenzo Maffione.Va NM_OPEN_NO_MMAP
793668e070fSVincenzo Maffione(if arg points to the same memory region,
794fa7db06bSLuigi Rizzoavoids the mmap and uses the values from it);
795668e070fSVincenzo Maffione.Va NM_OPEN_IFNAME
796668e070fSVincenzo Maffione(ignores ifname and uses the values in arg);
797fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG1 ,
798fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG2 ,
799668e070fSVincenzo Maffione.Va NM_OPEN_ARG3
800668e070fSVincenzo Maffione(uses the fields from arg);
801668e070fSVincenzo Maffione.Va NM_OPEN_RING_CFG
802668e070fSVincenzo Maffione(uses the ring number and sizes from arg).
80317885a7bSLuigi Rizzo.El
804fa7db06bSLuigi Rizzo.It Va int nm_close(struct nm_desc *d )
80517885a7bSLuigi Rizzocloses the file descriptor, unmaps memory, frees resources.
806fa7db06bSLuigi Rizzo.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size )
807668e070fSVincenzo Maffionesimilar to
808668e070fSVincenzo Maffione.Va pcap_inject() ,
809668e070fSVincenzo Maffionepushes a packet to a ring, returns the size
81017885a7bSLuigi Rizzoof the packet is successful, or 0 on error;
811fa7db06bSLuigi Rizzo.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg )
812668e070fSVincenzo Maffionesimilar to
813668e070fSVincenzo Maffione.Va pcap_dispatch() ,
814668e070fSVincenzo Maffioneapplies a callback to incoming packets
815fa7db06bSLuigi Rizzo.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr )
816668e070fSVincenzo Maffionesimilar to
817668e070fSVincenzo Maffione.Va pcap_next() ,
818668e070fSVincenzo Maffionefetches the next packet
81917885a7bSLuigi Rizzo.El
82017885a7bSLuigi Rizzo.Sh SUPPORTED DEVICES
82117885a7bSLuigi Rizzo.Nm
82217885a7bSLuigi Rizzonatively supports the following devices:
82317885a7bSLuigi Rizzo.Pp
824668e070fSVincenzo MaffioneOn
825668e070fSVincenzo Maffione.Fx :
82637e3a6d3SLuigi Rizzo.Xr cxgbe 4 ,
82717885a7bSLuigi Rizzo.Xr em 4 ,
828668e070fSVincenzo Maffione.Xr iflib 4
8291c37b63fSGlen Barber.Pq providing Xr igb 4 and Xr em 4 ,
83017885a7bSLuigi Rizzo.Xr ixgbe 4 ,
83137e3a6d3SLuigi Rizzo.Xr ixl 4 ,
832668e070fSVincenzo Maffione.Xr re 4 ,
833668e070fSVincenzo Maffione.Xr vtnet 4 .
83417885a7bSLuigi Rizzo.Pp
835668e070fSVincenzo MaffioneOn Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
83617885a7bSLuigi Rizzo.Pp
83717885a7bSLuigi RizzoNICs without native support can still be used in
83817885a7bSLuigi Rizzo.Nm
839e91d04f7SChristian Brueffermode through emulation.
840e91d04f7SChristian BruefferPerformance is inferior to native netmap
84137e3a6d3SLuigi Rizzomode but still significantly higher than various raw socket types
84237e3a6d3SLuigi Rizzo(bpf, PF_PACKET, etc.).
84337e3a6d3SLuigi RizzoNote that for slow devices (such as 1 Gbit/s and slower NICs,
844fbb9e715SLuigi Rizzoor several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
845fbb9e715SLuigi Rizzoemulated and native mode will likely have similar or same throughput.
8463f879a47SEnji Cooper.Pp
84737e3a6d3SLuigi RizzoWhen emulation is in use, packet sniffer programs such as tcpdump
8483f879a47SEnji Coopercould see received packets before they are diverted by netmap.
8493f879a47SEnji CooperThis behaviour is not intentional, being just an artifact of the implementation
8503f879a47SEnji Cooperof emulation.
85137e3a6d3SLuigi RizzoNote that in case the netmap application subsequently moves packets received
85237e3a6d3SLuigi Rizzofrom the emulated adapter onto the host RX ring, the sniffer will intercept
85337e3a6d3SLuigi Rizzothose packets again, since the packets are injected to the host stack as they
85437e3a6d3SLuigi Rizzowere received by the network interface.
85517885a7bSLuigi Rizzo.Pp
85617885a7bSLuigi RizzoEmulation is also available for devices with native netmap support,
85717885a7bSLuigi Rizzowhich can be used for testing or performance comparison.
85817885a7bSLuigi RizzoThe sysctl variable
85917885a7bSLuigi Rizzo.Va dev.netmap.admode
86017885a7bSLuigi Rizzoglobally controls how netmap mode is implemented.
86117885a7bSLuigi Rizzo.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
862723180daSVincenzo MaffioneSome aspects of the operation of
86317885a7bSLuigi Rizzo.Nm
864723180daSVincenzo Maffioneand
865723180daSVincenzo Maffione.Nm VALE
866668e070fSVincenzo Maffioneare controlled through sysctl variables on
867668e070fSVincenzo Maffione.Fx
86817885a7bSLuigi Rizzo.Em ( dev.netmap.* )
86917885a7bSLuigi Rizzoand module parameters on Linux
870668e070fSVincenzo Maffione.Em ( /sys/module/netmap/parameters/* ) :
87117885a7bSLuigi Rizzo.Bl -tag -width indent
87217885a7bSLuigi Rizzo.It Va dev.netmap.admode: 0
87317885a7bSLuigi RizzoControls the use of native or emulated adapter mode.
8743f879a47SEnji Cooper.Pp
87537e3a6d3SLuigi Rizzo0 uses the best available option;
8763f879a47SEnji Cooper.Pp
87737e3a6d3SLuigi Rizzo1 forces native mode and fails if not available;
8783f879a47SEnji Cooper.Pp
87937e3a6d3SLuigi Rizzo2 forces emulated hence never fails.
880668e070fSVincenzo Maffione.It Va dev.netmap.generic_rings: 1
881668e070fSVincenzo MaffioneNumber of rings used for emulated netmap mode
88217885a7bSLuigi Rizzo.It Va dev.netmap.generic_ringsize: 1024
88317885a7bSLuigi RizzoRing size used for emulated netmap mode
88417885a7bSLuigi Rizzo.It Va dev.netmap.generic_mit: 100000
88517885a7bSLuigi RizzoControls interrupt moderation for emulated mode
88617885a7bSLuigi Rizzo.It Va dev.netmap.fwd: 0
88717885a7bSLuigi RizzoForces NS_FORWARD mode
88817885a7bSLuigi Rizzo.It Va dev.netmap.txsync_retry: 2
889723180daSVincenzo MaffioneNumber of txsync loops in the
890723180daSVincenzo Maffione.Nm VALE
891723180daSVincenzo Maffioneflush function
89217885a7bSLuigi Rizzo.It Va dev.netmap.no_pendintr: 1
89317885a7bSLuigi RizzoForces recovery of transmit buffers on system calls
89417885a7bSLuigi Rizzo.It Va dev.netmap.no_timestamp: 0
89517885a7bSLuigi RizzoDisables the update of the timestamp in the netmap ring
89617885a7bSLuigi Rizzo.It Va dev.netmap.verbose: 0
89717885a7bSLuigi RizzoVerbose kernel messages
89817885a7bSLuigi Rizzo.It Va dev.netmap.buf_num: 163840
89917885a7bSLuigi Rizzo.It Va dev.netmap.buf_size: 2048
90017885a7bSLuigi Rizzo.It Va dev.netmap.ring_num: 200
90117885a7bSLuigi Rizzo.It Va dev.netmap.ring_size: 36864
90217885a7bSLuigi Rizzo.It Va dev.netmap.if_num: 100
90317885a7bSLuigi Rizzo.It Va dev.netmap.if_size: 1024
90417885a7bSLuigi RizzoSizes and number of objects (netmap_if, netmap_ring, buffers)
905e91d04f7SChristian Bruefferfor the global memory region.
906e91d04f7SChristian BruefferThe only parameter worth modifying is
90717885a7bSLuigi Rizzo.Va dev.netmap.buf_num
90817885a7bSLuigi Rizzoas it impacts the total amount of memory used by netmap.
90917885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_num: 0
91017885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_size: 0
91117885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_num: 0
91217885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_size: 0
91317885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_num: 0
91417885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_size: 0
91517885a7bSLuigi RizzoActual values in use.
916723180daSVincenzo Maffione.It Va dev.netmap.priv_buf_num: 4098
917723180daSVincenzo Maffione.It Va dev.netmap.priv_buf_size: 2048
918723180daSVincenzo Maffione.It Va dev.netmap.priv_ring_num: 4
919723180daSVincenzo Maffione.It Va dev.netmap.priv_ring_size: 20480
920723180daSVincenzo Maffione.It Va dev.netmap.priv_if_num: 2
921723180daSVincenzo Maffione.It Va dev.netmap.priv_if_size: 1024
922723180daSVincenzo MaffioneSizes and number of objects (netmap_if, netmap_ring, buffers)
923723180daSVincenzo Maffionefor private memory regions.
924723180daSVincenzo MaffioneA separate memory region is used for each
925723180daSVincenzo Maffione.Nm VALE
926723180daSVincenzo Maffioneport and each pair of
927723180daSVincenzo Maffione.Nm netmap pipes .
92817885a7bSLuigi Rizzo.It Va dev.netmap.bridge_batch: 1024
92917885a7bSLuigi RizzoBatch size used when moving packets across a
93017885a7bSLuigi Rizzo.Nm VALE
931e91d04f7SChristian Bruefferswitch.
932e91d04f7SChristian BruefferValues above 64 generally guarantee good
93317885a7bSLuigi Rizzoperformance.
934dd6ab49aSVincenzo Maffione.It Va dev.netmap.max_bridges: 8
935dd6ab49aSVincenzo MaffioneMax number of
936dd6ab49aSVincenzo Maffione.Nm VALE
937dd6ab49aSVincenzo Maffioneswitches that can be created. This tunable can be specified
938dd6ab49aSVincenzo Maffioneat loader time.
939668e070fSVincenzo Maffione.It Va dev.netmap.ptnet_vnet_hdr: 1
940668e070fSVincenzo MaffioneAllow ptnet devices to use virtio-net headers
941*1bae9dc5SMark Johnston.It Va dev.netmap.port_numa_affinity: 0
942*1bae9dc5SMark JohnstonOn
943*1bae9dc5SMark Johnston.Xr numa 4
944*1bae9dc5SMark Johnstonsystems, allocate memory for netmap ports from the local NUMA domain when
945*1bae9dc5SMark Johnstonpossible.
946*1bae9dc5SMark JohnstonThis can improve performance by reducing the number of remote memory accesses.
947*1bae9dc5SMark JohnstonHowever, when forwarding packets between ports attached to different NUMA
948*1bae9dc5SMark Johnstondomains, this will prevent zero-copy forwarding optimizations and thus may hurt
949*1bae9dc5SMark Johnstonperformance.
950*1bae9dc5SMark JohnstonNote that this setting must be specified as a loader tunable at boot time.
95117885a7bSLuigi Rizzo.El
95213a5d88fSLuigi Rizzo.Sh SYSTEM CALLS
95368b8534bSLuigi Rizzo.Nm
95468b8534bSLuigi Rizzouses
955fa7db06bSLuigi Rizzo.Xr select 2 ,
956fa7db06bSLuigi Rizzo.Xr poll 2 ,
957668e070fSVincenzo Maffione.Xr epoll 7
95868b8534bSLuigi Rizzoand
95937e3a6d3SLuigi Rizzo.Xr kqueue 2
960ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and
961ce3ee1e7SLuigi Rizzo.Xr mmap 2
962ce3ee1e7SLuigi Rizzoto map memory.
96317885a7bSLuigi Rizzo.Xr ioctl 2
96417885a7bSLuigi Rizzois used to configure ports and
96517885a7bSLuigi Rizzo.Nm VALE switches .
966ce3ee1e7SLuigi Rizzo.Pp
967ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to
968ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard
969ce3ee1e7SLuigi RizzoOS primitives, see
970ce3ee1e7SLuigi Rizzo.Xr pthread 3 .
971ce3ee1e7SLuigi RizzoIn particular,
972ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3
973ce3ee1e7SLuigi Rizzomay be of use.
97468b8534bSLuigi Rizzo.Sh EXAMPLES
97517885a7bSLuigi Rizzo.Ss TEST PROGRAMS
97617885a7bSLuigi Rizzo.Nm
97717885a7bSLuigi Rizzocomes with a few programs that can be used for testing or
97817885a7bSLuigi Rizzosimple applications.
97917885a7bSLuigi RizzoSee the
980e91d04f7SChristian Brueffer.Pa examples/
98117885a7bSLuigi Rizzodirectory in
98217885a7bSLuigi Rizzo.Nm
98317885a7bSLuigi Rizzodistributions, or
984e91d04f7SChristian Brueffer.Pa tools/tools/netmap/
985e91d04f7SChristian Bruefferdirectory in
986e91d04f7SChristian Brueffer.Fx
987e91d04f7SChristian Bruefferdistributions.
98817885a7bSLuigi Rizzo.Pp
9893f879a47SEnji Cooper.Xr pkt-gen 8
99017885a7bSLuigi Rizzois a general purpose traffic source/sink.
99117885a7bSLuigi Rizzo.Pp
99217885a7bSLuigi RizzoAs an example
99317885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f tx -l 60
99417885a7bSLuigi Rizzocan generate an infinite stream of minimum size packets, and
99517885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f rx
99617885a7bSLuigi Rizzois a traffic sink.
99717885a7bSLuigi RizzoBoth print traffic statistics, to help monitor
99817885a7bSLuigi Rizzohow the system performs.
99917885a7bSLuigi Rizzo.Pp
10003f879a47SEnji Cooper.Xr pkt-gen 8
100117885a7bSLuigi Rizzohas many options can be uses to set packet sizes, addresses,
100217885a7bSLuigi Rizzorates, and use multiple send/receive threads and cores.
100317885a7bSLuigi Rizzo.Pp
10043f879a47SEnji Cooper.Xr bridge 4
100517885a7bSLuigi Rizzois another test program which interconnects two
100617885a7bSLuigi Rizzo.Nm
1007e91d04f7SChristian Bruefferports.
1008e91d04f7SChristian BruefferIt can be used for transparent forwarding between
100917885a7bSLuigi Rizzointerfaces, as in
1010e1344f3cSVincenzo Maffione.Dl bridge -i netmap:ix0 -i netmap:ix1
101117885a7bSLuigi Rizzoor even connect the NIC to the host stack using netmap
1012e1344f3cSVincenzo Maffione.Dl bridge -i netmap:ix0
101317885a7bSLuigi Rizzo.Ss USING THE NATIVE API
1014197f150cSNick HibmaThe following code implements a traffic generator:
101568b8534bSLuigi Rizzo.Pp
101668b8534bSLuigi Rizzo.Bd -literal -compact
101768b8534bSLuigi Rizzo#include <net/netmap_user.h>
1018fe1e4a6cSBaptiste Daroussin\&...
101917885a7bSLuigi Rizzovoid sender(void)
102017885a7bSLuigi Rizzo{
102168b8534bSLuigi Rizzo    struct netmap_if *nifp;
102268b8534bSLuigi Rizzo    struct netmap_ring *ring;
1023d83a410eSHiren Panchasara    struct nmreq nmr;
102417885a7bSLuigi Rizzo    struct pollfd fds;
102568b8534bSLuigi Rizzo
102668b8534bSLuigi Rizzo    fd = open("/dev/netmap", O_RDWR);
102768b8534bSLuigi Rizzo    bzero(&nmr, sizeof(nmr));
1028d83a410eSHiren Panchasara    strcpy(nmr.nr_name, "ix0");
1029ce3ee1e7SLuigi Rizzo    nmr.nm_version = NETMAP_API;
1030ce3ee1e7SLuigi Rizzo    ioctl(fd, NIOCREGIF, &nmr);
1031d83a410eSHiren Panchasara    p = mmap(0, nmr.nr_memsize, fd);
1032ce3ee1e7SLuigi Rizzo    nifp = NETMAP_IF(p, nmr.nr_offset);
103368b8534bSLuigi Rizzo    ring = NETMAP_TXRING(nifp, 0);
103468b8534bSLuigi Rizzo    fds.fd = fd;
103568b8534bSLuigi Rizzo    fds.events = POLLOUT;
103668b8534bSLuigi Rizzo    for (;;) {
103717885a7bSLuigi Rizzo	poll(&fds, 1, -1);
103817885a7bSLuigi Rizzo	while (!nm_ring_empty(ring)) {
103968b8534bSLuigi Rizzo	    i = ring->cur;
104068b8534bSLuigi Rizzo	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
104168b8534bSLuigi Rizzo	    ... prepare packet in buf ...
104268b8534bSLuigi Rizzo	    ring->slot[i].len = ... packet length ...
104317885a7bSLuigi Rizzo	    ring->head = ring->cur = nm_ring_next(ring, i);
104417885a7bSLuigi Rizzo	}
104568b8534bSLuigi Rizzo    }
104668b8534bSLuigi Rizzo}
104768b8534bSLuigi Rizzo.Ed
104817885a7bSLuigi Rizzo.Ss HELPER FUNCTIONS
1049197f150cSNick HibmaA simple receiver can be implemented using the helper functions:
1050197f150cSNick Hibma.Pp
105117885a7bSLuigi Rizzo.Bd -literal -compact
105217885a7bSLuigi Rizzo#define NETMAP_WITH_LIBS
105317885a7bSLuigi Rizzo#include <net/netmap_user.h>
1054fe1e4a6cSBaptiste Daroussin\&...
105517885a7bSLuigi Rizzovoid receiver(void)
105617885a7bSLuigi Rizzo{
1057fa7db06bSLuigi Rizzo    struct nm_desc *d;
105817885a7bSLuigi Rizzo    struct pollfd fds;
105917885a7bSLuigi Rizzo    u_char *buf;
1060fa7db06bSLuigi Rizzo    struct nm_pkthdr h;
106117885a7bSLuigi Rizzo    ...
106217885a7bSLuigi Rizzo    d = nm_open("netmap:ix0", NULL, 0, 0);
106317885a7bSLuigi Rizzo    fds.fd = NETMAP_FD(d);
106417885a7bSLuigi Rizzo    fds.events = POLLIN;
106517885a7bSLuigi Rizzo    for (;;) {
106617885a7bSLuigi Rizzo	poll(&fds, 1, -1);
106717885a7bSLuigi Rizzo        while ( (buf = nm_nextpkt(d, &h)) )
1068c97d2c8aSVincenzo Maffione	    consume_pkt(buf, h.len);
106917885a7bSLuigi Rizzo    }
107017885a7bSLuigi Rizzo    nm_close(d);
107117885a7bSLuigi Rizzo}
107217885a7bSLuigi Rizzo.Ed
107317885a7bSLuigi Rizzo.Ss ZERO-COPY FORWARDING
107417885a7bSLuigi RizzoSince physical interfaces share the same memory region,
107517885a7bSLuigi Rizzoit is possible to do packet forwarding between ports
1076e91d04f7SChristian Bruefferswapping buffers.
1077e91d04f7SChristian BruefferThe buffer from the transmit ring is used
107817885a7bSLuigi Rizzoto replenish the receive ring:
1079197f150cSNick Hibma.Pp
108017885a7bSLuigi Rizzo.Bd -literal -compact
108117885a7bSLuigi Rizzo    uint32_t tmp;
108217885a7bSLuigi Rizzo    struct netmap_slot *src, *dst;
108317885a7bSLuigi Rizzo    ...
108417885a7bSLuigi Rizzo    src = &src_ring->slot[rxr->cur];
108517885a7bSLuigi Rizzo    dst = &dst_ring->slot[txr->cur];
108617885a7bSLuigi Rizzo    tmp = dst->buf_idx;
108717885a7bSLuigi Rizzo    dst->buf_idx = src->buf_idx;
108817885a7bSLuigi Rizzo    dst->len = src->len;
108917885a7bSLuigi Rizzo    dst->flags = NS_BUF_CHANGED;
109017885a7bSLuigi Rizzo    src->buf_idx = tmp;
109117885a7bSLuigi Rizzo    src->flags = NS_BUF_CHANGED;
109217885a7bSLuigi Rizzo    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
109317885a7bSLuigi Rizzo    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
109417885a7bSLuigi Rizzo    ...
109517885a7bSLuigi Rizzo.Ed
109617885a7bSLuigi Rizzo.Ss ACCESSING THE HOST STACK
1097fa7db06bSLuigi RizzoThe host stack is for all practical purposes just a regular ring pair,
10983f879a47SEnji Cooperwhich you can access with the netmap API (e.g., with
1099fa7db06bSLuigi Rizzo.Dl nm_open("netmap:eth0^", ... ) ;
1100fa7db06bSLuigi RizzoAll packets that the host would send to an interface in
1101fa7db06bSLuigi Rizzo.Nm
1102fa7db06bSLuigi Rizzomode end up into the RX ring, whereas all packets queued to the
1103fa7db06bSLuigi RizzoTX ring are send up to the host stack.
110417885a7bSLuigi Rizzo.Ss VALE SWITCH
110517885a7bSLuigi RizzoA simple way to test the performance of a
110617885a7bSLuigi Rizzo.Nm VALE
110717885a7bSLuigi Rizzoswitch is to attach a sender and a receiver to it,
11083f879a47SEnji Coopere.g., running the following in two different terminals:
110917885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:a -f rx # receiver
111017885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:b -f tx # sender
1111fa7db06bSLuigi RizzoThe same example can be used to test netmap pipes, by simply
11123f879a47SEnji Cooperchanging port names, e.g.,
111337e3a6d3SLuigi Rizzo.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
111437e3a6d3SLuigi Rizzo.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
111517885a7bSLuigi Rizzo.Pp
111617885a7bSLuigi RizzoThe following command attaches an interface and the host stack
111717885a7bSLuigi Rizzoto a switch:
1118c7c78055SVincenzo Maffione.Dl valectl -h vale2:em0
111917885a7bSLuigi RizzoOther
112068b8534bSLuigi Rizzo.Nm
112117885a7bSLuigi Rizzoclients attached to the same switch can now communicate
112217885a7bSLuigi Rizzowith the network card or the host.
112313a5d88fSLuigi Rizzo.Sh SEE ALSO
1124689f146bSVincenzo Maffione.Xr vale 4 ,
1125689f146bSVincenzo Maffione.Xr bridge 8 ,
112668445e34SChristian Brueffer.Xr valectl 8 ,
1127689f146bSVincenzo Maffione.Xr lb 8 ,
1128668e070fSVincenzo Maffione.Xr nmreplay 8 ,
1129689f146bSVincenzo Maffione.Xr pkt-gen 8
11301a7d3c05SVincenzo Maffione.Pp
11310b3504fdSChristian Brueffer.Pa http://info.iet.unipi.it/~luigi/netmap/
113213a5d88fSLuigi Rizzo.Pp
113313a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework,
113413a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012
113513a5d88fSLuigi Rizzo.Pp
113613a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O,
113713a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston
1138fa7db06bSLuigi Rizzo.Pp
1139fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri,
1140fa7db06bSLuigi RizzoVALE, a switched ethernet for virtual machines,
1141fa7db06bSLuigi RizzoACM CoNEXT'12, December 2012, Nice
1142fa7db06bSLuigi Rizzo.Pp
1143fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1144fa7db06bSLuigi RizzoSpeeding up packet I/O in virtual machines,
1145fa7db06bSLuigi RizzoACM/IEEE ANCS'13, October 2013, San Jose
114668b8534bSLuigi Rizzo.Sh AUTHORS
114713a5d88fSLuigi Rizzo.An -nosplit
114868b8534bSLuigi RizzoThe
114968b8534bSLuigi Rizzo.Nm
1150ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the
115113a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by
115213a5d88fSLuigi Rizzo.An Luigi Rizzo ,
1153ce3ee1e7SLuigi Rizzoand further extended with help from
115413a5d88fSLuigi Rizzo.An Matteo Landi ,
115513a5d88fSLuigi Rizzo.An Gaetano Catalli ,
1156ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri ,
1157e91d04f7SChristian Bruefferand
1158ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione .
115913a5d88fSLuigi Rizzo.Pp
116013a5d88fSLuigi Rizzo.Nm
1161ce3ee1e7SLuigi Rizzoand
1162ce3ee1e7SLuigi Rizzo.Nm VALE
1163ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects
1164ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581).
1165bf15fc88SJoel Dahl.Sh CAVEATS
1166bf15fc88SJoel DahlNo matter how fast the CPU and OS are,
1167bf15fc88SJoel Dahlachieving line rate on 10G and faster interfaces
1168bf15fc88SJoel Dahlrequires hardware with sufficient performance.
1169bf15fc88SJoel DahlSeveral NICs are unable to sustain line rate with
1170e91d04f7SChristian Brueffersmall packet sizes.
1171e91d04f7SChristian BruefferInsufficient PCIe or memory bandwidth
1172bf15fc88SJoel Dahlcan also cause reduced performance.
1173bf15fc88SJoel Dahl.Pp
1174bf15fc88SJoel DahlAnother frequent reason for low performance is the use
1175bf15fc88SJoel Dahlof flow control on the link: a slow receiver can limit
1176bf15fc88SJoel Dahlthe transmit speed.
1177bf15fc88SJoel DahlBe sure to disable flow control when running high
1178bf15fc88SJoel Dahlspeed experiments.
1179bf15fc88SJoel Dahl.Ss SPECIAL NIC FEATURES
1180bf15fc88SJoel Dahl.Nm
1181bf15fc88SJoel Dahlis orthogonal to some NIC features such as
1182bf15fc88SJoel Dahlmultiqueue, schedulers, packet filters.
1183bf15fc88SJoel Dahl.Pp
1184bf15fc88SJoel DahlMultiple transmit and receive rings are supported natively
1185bf15fc88SJoel Dahland can be configured with ordinary OS tools,
1186bf15fc88SJoel Dahlsuch as
11873f879a47SEnji Cooper.Xr ethtool 8
1188bf15fc88SJoel Dahlor
1189bf15fc88SJoel Dahldevice-specific sysctl variables.
1190bf15fc88SJoel DahlThe same goes for Receive Packet Steering (RPS)
1191bf15fc88SJoel Dahland filtering of incoming traffic.
1192bf15fc88SJoel Dahl.Pp
1193bf15fc88SJoel Dahl.Nm
1194bf15fc88SJoel Dahl.Em does not use
1195bf15fc88SJoel Dahlfeatures such as
1196bf15fc88SJoel Dahl.Em checksum offloading , TCP segmentation offloading ,
1197bf15fc88SJoel Dahl.Em encryption , VLAN encapsulation/decapsulation ,
1198e91d04f7SChristian Bruefferetc.
1199bf15fc88SJoel DahlWhen using netmap to exchange packets with the host stack,
1200bf15fc88SJoel Dahlmake sure to disable these features.
1201