xref: /freebsd/share/man/man4/netmap.4 (revision 1a7d3c055b83822ac09c8d71a16faca54a7e9f7d)
117885a7bSLuigi Rizzo.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
268b8534bSLuigi Rizzo.\" All rights reserved.
368b8534bSLuigi Rizzo.\"
468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without
568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions
668b8534bSLuigi Rizzo.\" are met:
768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright
868b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer.
968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright
1068b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer in the
1168b8534bSLuigi Rizzo.\"    documentation and/or other materials provided with the distribution.
1268b8534bSLuigi Rizzo.\"
1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
2368b8534bSLuigi Rizzo.\" SUCH DAMAGE.
2468b8534bSLuigi Rizzo.\"
2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4)
2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix.
2768b8534bSLuigi Rizzo.\"
2868b8534bSLuigi Rizzo.\" $FreeBSD$
2968b8534bSLuigi Rizzo.\"
30*1a7d3c05SVincenzo Maffione.Dd October 23, 2018
3168b8534bSLuigi Rizzo.Dt NETMAP 4
3268b8534bSLuigi Rizzo.Os
3368b8534bSLuigi Rizzo.Sh NAME
3468b8534bSLuigi Rizzo.Nm netmap
3568b8534bSLuigi Rizzo.Nd a framework for fast packet I/O
3617885a7bSLuigi Rizzo.Nm VALE
3717885a7bSLuigi Rizzo.Nd a fast VirtuAl Local Ethernet using the netmap API
383f879a47SEnji Cooper.Pp
39fa7db06bSLuigi Rizzo.Nm netmap pipes
40fa7db06bSLuigi Rizzo.Nd a shared memory packet transport channel
4168b8534bSLuigi Rizzo.Sh SYNOPSIS
4268b8534bSLuigi Rizzo.Cd device netmap
4368b8534bSLuigi Rizzo.Sh DESCRIPTION
4468b8534bSLuigi Rizzo.Nm
45ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O
4637e3a6d3SLuigi Rizzofor userspace and kernel clients, and for Virtual Machines.
47e91d04f7SChristian BruefferIt runs on
48e91d04f7SChristian Brueffer.Fx
4937e3a6d3SLuigi RizzoLinux and some versions of Windows, and supports a variety of
5037e3a6d3SLuigi Rizzo.Nm netmap ports ,
5137e3a6d3SLuigi Rizzoincluding
5237e3a6d3SLuigi Rizzo.Bl -tag -width XXXX
5337e3a6d3SLuigi Rizzo.It Nm physical NIC ports
5437e3a6d3SLuigi Rizzoto access individual queues of network interfaces;
5537e3a6d3SLuigi Rizzo.It Nm host ports
5637e3a6d3SLuigi Rizzoto inject packets into the host stack;
5737e3a6d3SLuigi Rizzo.It Nm VALE ports
5837e3a6d3SLuigi Rizzoimplementing a very fast and modular in-kernel software switch/dataplane;
5937e3a6d3SLuigi Rizzo.It Nm netmap pipes
6037e3a6d3SLuigi Rizzoa shared memory packet transport channel;
6137e3a6d3SLuigi Rizzo.It Nm netmap monitors
6237e3a6d3SLuigi Rizzoa mechanism similar to
633f879a47SEnji Cooper.Xr bpf 4
6437e3a6d3SLuigi Rizzoto capture traffic
6537e3a6d3SLuigi Rizzo.El
66fa7db06bSLuigi Rizzo.Pp
6737e3a6d3SLuigi RizzoAll these
6837e3a6d3SLuigi Rizzo.Nm netmap ports
6937e3a6d3SLuigi Rizzoare accessed interchangeably with the same API,
7037e3a6d3SLuigi Rizzoand are at least one order of magnitude faster than
71fa7db06bSLuigi Rizzostandard OS mechanisms
7237e3a6d3SLuigi Rizzo(sockets, bpf, tun/tap interfaces, native switches, pipes).
7337e3a6d3SLuigi RizzoWith suitably fast hardware (NICs, PCIe buses, CPUs),
7437e3a6d3SLuigi Rizzopacket I/O using
7537e3a6d3SLuigi Rizzo.Nm
7637e3a6d3SLuigi Rizzoon supported NICs
7737e3a6d3SLuigi Rizzoreaches 14.88 million packets per second (Mpps)
7837e3a6d3SLuigi Rizzowith much less than one core on 10 Gbit/s NICs;
7937e3a6d3SLuigi Rizzo35-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
8037e3a6d3SLuigi Rizzoabout 20 Mpps per core for VALE ports;
8137e3a6d3SLuigi Rizzoand over 100 Mpps for
8237e3a6d3SLuigi Rizzo.Nm netmap pipes.
8337e3a6d3SLuigi RizzoNICs without native
8437e3a6d3SLuigi Rizzo.Nm
8537e3a6d3SLuigi Rizzosupport can still use the API in emulated mode,
8637e3a6d3SLuigi Rizzowhich uses unmodified device drivers and is 3-5 times faster than
873f879a47SEnji Cooper.Xr bpf 4
8837e3a6d3SLuigi Rizzoor raw sockets.
89ce3ee1e7SLuigi Rizzo.Pp
9017885a7bSLuigi RizzoUserspace clients can dynamically switch NICs into
9168b8534bSLuigi Rizzo.Nm
9217885a7bSLuigi Rizzomode and send and receive raw packets through
9317885a7bSLuigi Rizzomemory mapped buffers.
9417885a7bSLuigi RizzoSimilarly,
9517885a7bSLuigi Rizzo.Nm VALE
9637e3a6d3SLuigi Rizzoswitch instances and ports,
97fa7db06bSLuigi Rizzo.Nm netmap pipes
9837e3a6d3SLuigi Rizzoand
9937e3a6d3SLuigi Rizzo.Nm netmap monitors
100fa7db06bSLuigi Rizzocan be created dynamically,
10117885a7bSLuigi Rizzoproviding high speed packet I/O between processes,
10217885a7bSLuigi Rizzovirtual machines, NICs and the host stack.
10317885a7bSLuigi Rizzo.Pp
104fa7db06bSLuigi Rizzo.Nm
105e91d04f7SChristian Brueffersupports both non-blocking I/O through
106e91d04f7SChristian Brueffer.Xr ioctl 2 ,
107fa7db06bSLuigi Rizzosynchronization and blocking I/O through a file descriptor
108fa7db06bSLuigi Rizzoand standard OS mechanisms such as
109fa7db06bSLuigi Rizzo.Xr select 2 ,
110fa7db06bSLuigi Rizzo.Xr poll 2 ,
111fa7db06bSLuigi Rizzo.Xr epoll 2 ,
112e91d04f7SChristian Bruefferand
113fa7db06bSLuigi Rizzo.Xr kqueue 2 .
11437e3a6d3SLuigi RizzoAll types of
11537e3a6d3SLuigi Rizzo.Nm netmap ports
11637e3a6d3SLuigi Rizzoand the
11737e3a6d3SLuigi Rizzo.Nm VALE switch
118fa7db06bSLuigi Rizzoare implemented by a single kernel module, which also emulates the
119fa7db06bSLuigi Rizzo.Nm
12037e3a6d3SLuigi RizzoAPI over standard drivers.
12117885a7bSLuigi RizzoFor best performance,
12268b8534bSLuigi Rizzo.Nm
12337e3a6d3SLuigi Rizzorequires native support in device drivers.
12437e3a6d3SLuigi RizzoA list of such devices is at the end of this document.
125ce3ee1e7SLuigi Rizzo.Pp
12617885a7bSLuigi RizzoIn the rest of this (long) manual page we document
12717885a7bSLuigi Rizzovarious aspects of the
128ce3ee1e7SLuigi Rizzo.Nm
12917885a7bSLuigi Rizzoand
130ce3ee1e7SLuigi Rizzo.Nm VALE
13117885a7bSLuigi Rizzoarchitecture, features and usage.
13217885a7bSLuigi Rizzo.Sh ARCHITECTURE
13317885a7bSLuigi Rizzo.Nm
13417885a7bSLuigi Rizzosupports raw packet I/O through a
13517885a7bSLuigi Rizzo.Em port ,
13617885a7bSLuigi Rizzowhich can be connected to a physical interface
13717885a7bSLuigi Rizzo.Em ( NIC ) ,
13817885a7bSLuigi Rizzoto the host stack,
13917885a7bSLuigi Rizzoor to a
14017885a7bSLuigi Rizzo.Nm VALE
14137e3a6d3SLuigi Rizzoswitch.
14217885a7bSLuigi RizzoPorts use preallocated circular queues of buffers
14317885a7bSLuigi Rizzo.Em ( rings )
14417885a7bSLuigi Rizzoresiding in an mmapped region.
14517885a7bSLuigi RizzoThere is one ring for each transmit/receive queue of a
14617885a7bSLuigi RizzoNIC or virtual port.
14717885a7bSLuigi RizzoAn additional ring pair connects to the host stack.
148ce3ee1e7SLuigi Rizzo.Pp
14917885a7bSLuigi RizzoAfter binding a file descriptor to a port, a
15017885a7bSLuigi Rizzo.Nm
15117885a7bSLuigi Rizzoclient can send or receive packets in batches through
15217885a7bSLuigi Rizzothe rings, and possibly implement zero-copy forwarding
15317885a7bSLuigi Rizzobetween ports.
154ce3ee1e7SLuigi Rizzo.Pp
15517885a7bSLuigi RizzoAll NICs operating in
15668b8534bSLuigi Rizzo.Nm
157ce3ee1e7SLuigi Rizzomode use the same memory region,
15817885a7bSLuigi Rizzoaccessible to all processes who own
159e91d04f7SChristian Brueffer.Pa /dev/netmap
16017885a7bSLuigi Rizzofile descriptors bound to NICs.
161fa7db06bSLuigi RizzoIndependent
16217885a7bSLuigi Rizzo.Nm VALE
163fa7db06bSLuigi Rizzoand
164fa7db06bSLuigi Rizzo.Nm netmap pipe
165fa7db06bSLuigi Rizzoports
166fa7db06bSLuigi Rizzoby default use separate memory regions,
167fa7db06bSLuigi Rizzobut can be independently configured to share memory.
16817885a7bSLuigi Rizzo.Sh ENTERING AND EXITING NETMAP MODE
169fa7db06bSLuigi RizzoThe following section describes the system calls to create
170fa7db06bSLuigi Rizzoand control
171fa7db06bSLuigi Rizzo.Nm netmap
172fa7db06bSLuigi Rizzoports (including
173fa7db06bSLuigi Rizzo.Nm VALE
174fa7db06bSLuigi Rizzoand
175fa7db06bSLuigi Rizzo.Nm netmap pipe
176fa7db06bSLuigi Rizzoports).
1773f879a47SEnji CooperSimpler, higher level functions are described in the
1783f879a47SEnji Cooper.Sx LIBRARIES
1793f879a47SEnji Coopersection.
180fa7db06bSLuigi Rizzo.Pp
18117885a7bSLuigi RizzoPorts and rings are created and controlled through a file descriptor,
18217885a7bSLuigi Rizzocreated by opening a special device
18317885a7bSLuigi Rizzo.Dl fd = open("/dev/netmap");
18417885a7bSLuigi Rizzoand then bound to a specific port with an
18517885a7bSLuigi Rizzo.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
18617885a7bSLuigi Rizzo.Pp
18717885a7bSLuigi Rizzo.Nm
18817885a7bSLuigi Rizzohas multiple modes of operation controlled by the
18917885a7bSLuigi Rizzo.Vt struct nmreq
19017885a7bSLuigi Rizzoargument.
19117885a7bSLuigi Rizzo.Va arg.nr_name
19237e3a6d3SLuigi Rizzospecifies the netmap port name, as follows:
19317885a7bSLuigi Rizzo.Bl -tag -width XXXX
1943f879a47SEnji Cooper.It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
19517885a7bSLuigi Rizzothe data path of the NIC is disconnected from the host stack,
19617885a7bSLuigi Rizzoand the file descriptor is bound to the NIC (one or all queues),
19717885a7bSLuigi Rizzoor to the host stack;
19837e3a6d3SLuigi Rizzo.It Dv valeSSS:PPP
19937e3a6d3SLuigi Rizzothe file descriptor is bound to port PPP of VALE switch SSS.
20037e3a6d3SLuigi RizzoSwitch instances and ports are dynamically created if necessary.
2013f879a47SEnji Cooper.Pp
20237e3a6d3SLuigi RizzoBoth SSS and PPP have the form [0-9a-zA-Z_]+ , the string
20337e3a6d3SLuigi Rizzocannot exceed IFNAMSIZ characters, and PPP cannot
20417885a7bSLuigi Rizzobe the name of any existing OS network interface.
20517885a7bSLuigi Rizzo.El
20617885a7bSLuigi Rizzo.Pp
20717885a7bSLuigi RizzoOn return,
20817885a7bSLuigi Rizzo.Va arg
20917885a7bSLuigi Rizzoindicates the size of the shared memory region,
21017885a7bSLuigi Rizzoand the number, size and location of all the
21117885a7bSLuigi Rizzo.Nm
21217885a7bSLuigi Rizzodata structures, which can be accessed by mmapping the memory
21317885a7bSLuigi Rizzo.Dl char *mem = mmap(0, arg.nr_memsize, fd);
21417885a7bSLuigi Rizzo.Pp
215e91d04f7SChristian BruefferNon-blocking I/O is done with special
21617885a7bSLuigi Rizzo.Xr ioctl 2
21717885a7bSLuigi Rizzo.Xr select 2
21817885a7bSLuigi Rizzoand
21917885a7bSLuigi Rizzo.Xr poll 2
22017885a7bSLuigi Rizzoon the file descriptor permit blocking I/O.
22117885a7bSLuigi Rizzo.Xr epoll 2
22217885a7bSLuigi Rizzoand
22317885a7bSLuigi Rizzo.Xr kqueue 2
22417885a7bSLuigi Rizzoare not supported on
22517885a7bSLuigi Rizzo.Nm
22617885a7bSLuigi Rizzofile descriptors.
22717885a7bSLuigi Rizzo.Pp
22817885a7bSLuigi RizzoWhile a NIC is in
22917885a7bSLuigi Rizzo.Nm
23017885a7bSLuigi Rizzomode, the OS will still believe the interface is up and running.
23117885a7bSLuigi RizzoOS-generated packets for that NIC end up into a
23217885a7bSLuigi Rizzo.Nm
23317885a7bSLuigi Rizzoring, and another ring is used to send packets into the OS network stack.
23417885a7bSLuigi RizzoA
23517885a7bSLuigi Rizzo.Xr close 2
23617885a7bSLuigi Rizzoon the file descriptor removes the binding,
23717885a7bSLuigi Rizzoand returns the NIC to normal mode (reconnecting the data path
23817885a7bSLuigi Rizzoto the host stack), or destroys the virtual port.
23917885a7bSLuigi Rizzo.Sh DATA STRUCTURES
24017885a7bSLuigi RizzoThe data structures in the mmapped memory region are detailed in
241e91d04f7SChristian Brueffer.In sys/net/netmap.h ,
24217885a7bSLuigi Rizzowhich is the ultimate reference for the
24317885a7bSLuigi Rizzo.Nm
244e91d04f7SChristian BruefferAPI.
245e91d04f7SChristian BruefferThe main structures and fields are indicated below:
24668b8534bSLuigi Rizzo.Bl -tag -width XXX
24768b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface)
24868b8534bSLuigi Rizzo.Bd -literal
24968b8534bSLuigi Rizzostruct netmap_if {
25017885a7bSLuigi Rizzo    ...
25117885a7bSLuigi Rizzo    const uint32_t   ni_flags;      /* properties              */
25217885a7bSLuigi Rizzo    ...
25317885a7bSLuigi Rizzo    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
25417885a7bSLuigi Rizzo    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
255fa7db06bSLuigi Rizzo    uint32_t         ni_bufs_head;  /* head of extra bufs list */
25617885a7bSLuigi Rizzo    ...
25768b8534bSLuigi Rizzo};
25868b8534bSLuigi Rizzo.Ed
259ce3ee1e7SLuigi Rizzo.Pp
26017885a7bSLuigi RizzoIndicates the number of available rings
26117885a7bSLuigi Rizzo.Pa ( struct netmap_rings )
26217885a7bSLuigi Rizzoand their position in the mmapped region.
26317885a7bSLuigi RizzoThe number of tx and rx rings
26417885a7bSLuigi Rizzo.Pa ( ni_tx_rings , ni_rx_rings )
26517885a7bSLuigi Rizzonormally depends on the hardware.
26617885a7bSLuigi RizzoNICs also have an extra tx/rx ring pair connected to the host stack.
26717885a7bSLuigi Rizzo.Em NIOCREGIF
268fa7db06bSLuigi Rizzocan also request additional unbound buffers in the same memory space,
269fa7db06bSLuigi Rizzoto be used as temporary storage for packets.
270fa7db06bSLuigi Rizzo.Pa ni_bufs_head
271fa7db06bSLuigi Rizzocontains the index of the first of these free rings,
272fa7db06bSLuigi Rizzowhich are connected in a list (the first uint32_t of each
273fa7db06bSLuigi Rizzobuffer being the index of the next buffer in the list).
274e91d04f7SChristian BruefferA
275e91d04f7SChristian Brueffer.Dv 0
276e91d04f7SChristian Bruefferindicates the end of the list.
27717885a7bSLuigi Rizzo.It Dv struct netmap_ring (one per ring)
27868b8534bSLuigi Rizzo.Bd -literal
27968b8534bSLuigi Rizzostruct netmap_ring {
28017885a7bSLuigi Rizzo    ...
28117885a7bSLuigi Rizzo    const uint32_t num_slots;   /* slots in each ring            */
28217885a7bSLuigi Rizzo    const uint32_t nr_buf_size; /* size of each buffer           */
28317885a7bSLuigi Rizzo    ...
28417885a7bSLuigi Rizzo    uint32_t       head;        /* (u) first buf owned by user   */
28517885a7bSLuigi Rizzo    uint32_t       cur;         /* (u) wakeup position           */
28617885a7bSLuigi Rizzo    const uint32_t tail;        /* (k) first buf owned by kernel */
28717885a7bSLuigi Rizzo    ...
28817885a7bSLuigi Rizzo    uint32_t       flags;
28917885a7bSLuigi Rizzo    struct timeval ts;          /* (k) time of last rxsync()     */
29017885a7bSLuigi Rizzo    ...
291ce3ee1e7SLuigi Rizzo    struct netmap_slot slot[0]; /* array of slots                */
29268b8534bSLuigi Rizzo}
29368b8534bSLuigi Rizzo.Ed
294ce3ee1e7SLuigi Rizzo.Pp
29517885a7bSLuigi RizzoImplements transmit and receive rings, with read/write
296e91d04f7SChristian Bruefferpointers, metadata and an array of
297e91d04f7SChristian Brueffer.Em slots
29817885a7bSLuigi Rizzodescribing the buffers.
29917885a7bSLuigi Rizzo.It Dv struct netmap_slot (one per buffer)
30068b8534bSLuigi Rizzo.Bd -literal
30168b8534bSLuigi Rizzostruct netmap_slot {
30268b8534bSLuigi Rizzo    uint32_t buf_idx;           /* buffer index                 */
30368b8534bSLuigi Rizzo    uint16_t len;               /* packet length                */
30468b8534bSLuigi Rizzo    uint16_t flags;             /* buf changed, etc.            */
30517885a7bSLuigi Rizzo    uint64_t ptr;               /* address for indirect buffers */
30668b8534bSLuigi Rizzo};
30768b8534bSLuigi Rizzo.Ed
30817885a7bSLuigi Rizzo.Pp
30917885a7bSLuigi RizzoDescribes a packet buffer, which normally is identified by
31017885a7bSLuigi Rizzoan index and resides in the mmapped region.
31168b8534bSLuigi Rizzo.It Dv packet buffers
31217885a7bSLuigi RizzoFixed size (normally 2 KB) packet buffers allocated by the kernel.
313ce3ee1e7SLuigi Rizzo.El
314ce3ee1e7SLuigi Rizzo.Pp
31517885a7bSLuigi RizzoThe offset of the
31617885a7bSLuigi Rizzo.Pa struct netmap_if
31717885a7bSLuigi Rizzoin the mmapped region is indicated by the
31817885a7bSLuigi Rizzo.Pa nr_offset
31917885a7bSLuigi Rizzofield in the structure returned by
320e91d04f7SChristian Brueffer.Dv NIOCREGIF .
32117885a7bSLuigi RizzoFrom there, all other objects are reachable through
32217885a7bSLuigi Rizzorelative references (offsets or indexes).
323e91d04f7SChristian BruefferMacros and functions in
324e91d04f7SChristian Brueffer.In net/netmap_user.h
32517885a7bSLuigi Rizzohelp converting them into actual pointers:
32617885a7bSLuigi Rizzo.Pp
32717885a7bSLuigi Rizzo.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
32817885a7bSLuigi Rizzo.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
32917885a7bSLuigi Rizzo.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
33017885a7bSLuigi Rizzo.Pp
33117885a7bSLuigi Rizzo.Dl char *buf = NETMAP_BUF(ring, buffer_index);
33217885a7bSLuigi Rizzo.Sh RINGS, BUFFERS AND DATA I/O
33317885a7bSLuigi Rizzo.Va Rings
33417885a7bSLuigi Rizzoare circular queues of packets with three indexes/pointers
33517885a7bSLuigi Rizzo.Va ( head , cur , tail ) ;
33617885a7bSLuigi Rizzoone slot is always kept empty.
33717885a7bSLuigi RizzoThe ring size
33817885a7bSLuigi Rizzo.Va ( num_slots )
33917885a7bSLuigi Rizzoshould not be assumed to be a power of two.
34017885a7bSLuigi Rizzo.Pp
34117885a7bSLuigi Rizzo.Va head
34217885a7bSLuigi Rizzois the first slot available to userspace;
3433f879a47SEnji Cooper.Pp
34417885a7bSLuigi Rizzo.Va cur
34517885a7bSLuigi Rizzois the wakeup point:
34617885a7bSLuigi Rizzoselect/poll will unblock when
34717885a7bSLuigi Rizzo.Va tail
34817885a7bSLuigi Rizzopasses
34917885a7bSLuigi Rizzo.Va cur ;
3503f879a47SEnji Cooper.Pp
35117885a7bSLuigi Rizzo.Va tail
35217885a7bSLuigi Rizzois the first slot reserved to the kernel.
35317885a7bSLuigi Rizzo.Pp
354e91d04f7SChristian BruefferSlot indexes
355e91d04f7SChristian Brueffer.Em must
356e91d04f7SChristian Bruefferonly move forward;
35717885a7bSLuigi Rizzofor convenience, the function
35817885a7bSLuigi Rizzo.Dl nm_ring_next(ring, index)
35917885a7bSLuigi Rizzoreturns the next index modulo the ring size.
36017885a7bSLuigi Rizzo.Pp
36117885a7bSLuigi Rizzo.Va head
36217885a7bSLuigi Rizzoand
36317885a7bSLuigi Rizzo.Va cur
36417885a7bSLuigi Rizzoare only modified by the user program;
36517885a7bSLuigi Rizzo.Va tail
36617885a7bSLuigi Rizzois only modified by the kernel.
36717885a7bSLuigi RizzoThe kernel only reads/writes the
36817885a7bSLuigi Rizzo.Vt struct netmap_ring
36917885a7bSLuigi Rizzoslots and buffers
37017885a7bSLuigi Rizzoduring the execution of a netmap-related system call.
37117885a7bSLuigi RizzoThe only exception are slots (and buffers) in the range
37217885a7bSLuigi Rizzo.Va tail\  . . . head-1 ,
37317885a7bSLuigi Rizzothat are explicitly assigned to the kernel.
37417885a7bSLuigi Rizzo.Pp
37517885a7bSLuigi Rizzo.Ss TRANSMIT RINGS
37617885a7bSLuigi RizzoOn transmit rings, after a
37717885a7bSLuigi Rizzo.Nm
37817885a7bSLuigi Rizzosystem call, slots in the range
37917885a7bSLuigi Rizzo.Va head\  . . . tail-1
38017885a7bSLuigi Rizzoare available for transmission.
38117885a7bSLuigi RizzoUser code should fill the slots sequentially
38217885a7bSLuigi Rizzoand advance
38317885a7bSLuigi Rizzo.Va head
38417885a7bSLuigi Rizzoand
38517885a7bSLuigi Rizzo.Va cur
38617885a7bSLuigi Rizzopast slots ready to transmit.
38717885a7bSLuigi Rizzo.Va cur
38817885a7bSLuigi Rizzomay be moved further ahead if the user code needs
38917885a7bSLuigi Rizzomore slots before further transmissions (see
39017885a7bSLuigi Rizzo.Sx SCATTER GATHER I/O ) .
39117885a7bSLuigi Rizzo.Pp
39217885a7bSLuigi RizzoAt the next NIOCTXSYNC/select()/poll(),
39317885a7bSLuigi Rizzoslots up to
39417885a7bSLuigi Rizzo.Va head-1
39517885a7bSLuigi Rizzoare pushed to the port, and
39617885a7bSLuigi Rizzo.Va tail
39717885a7bSLuigi Rizzomay advance if further slots have become available.
39817885a7bSLuigi RizzoBelow is an example of the evolution of a TX ring:
39917885a7bSLuigi Rizzo.Bd -literal
40017885a7bSLuigi Rizzo    after the syscall, slots between cur and tail are (a)vailable
40117885a7bSLuigi Rizzo              head=cur   tail
40217885a7bSLuigi Rizzo               |          |
40317885a7bSLuigi Rizzo               v          v
40417885a7bSLuigi Rizzo     TX  [.....aaaaaaaaaaa.............]
40517885a7bSLuigi Rizzo
40617885a7bSLuigi Rizzo    user creates new packets to (T)ransmit
40717885a7bSLuigi Rizzo                head=cur tail
40817885a7bSLuigi Rizzo                    |     |
40917885a7bSLuigi Rizzo                    v     v
41017885a7bSLuigi Rizzo     TX  [.....TTTTTaaaaaa.............]
41117885a7bSLuigi Rizzo
41217885a7bSLuigi Rizzo    NIOCTXSYNC/poll()/select() sends packets and reports new slots
41317885a7bSLuigi Rizzo                head=cur      tail
41417885a7bSLuigi Rizzo                    |          |
41517885a7bSLuigi Rizzo                    v          v
41617885a7bSLuigi Rizzo     TX  [..........aaaaaaaaaaa........]
41717885a7bSLuigi Rizzo.Ed
41817885a7bSLuigi Rizzo.Pp
419e91d04f7SChristian Brueffer.Fn select
420e91d04f7SChristian Bruefferand
421e91d04f7SChristian Brueffer.Fn poll
4223f879a47SEnji Cooperwill block if there is no space in the ring, i.e.,
42317885a7bSLuigi Rizzo.Dl ring->cur == ring->tail
42417885a7bSLuigi Rizzoand return when new slots have become available.
42517885a7bSLuigi Rizzo.Pp
42617885a7bSLuigi RizzoHigh speed applications may want to amortize the cost of system calls
42717885a7bSLuigi Rizzoby preparing as many packets as possible before issuing them.
42817885a7bSLuigi Rizzo.Pp
42917885a7bSLuigi RizzoA transmit ring with pending transmissions has
43017885a7bSLuigi Rizzo.Dl ring->head != ring->tail + 1 (modulo the ring size).
43117885a7bSLuigi RizzoThe function
43217885a7bSLuigi Rizzo.Va int nm_tx_pending(ring)
43317885a7bSLuigi Rizzoimplements this test.
43417885a7bSLuigi Rizzo.Ss RECEIVE RINGS
43517885a7bSLuigi RizzoOn receive rings, after a
43617885a7bSLuigi Rizzo.Nm
43717885a7bSLuigi Rizzosystem call, the slots in the range
43817885a7bSLuigi Rizzo.Va head\& . . . tail-1
43917885a7bSLuigi Rizzocontain received packets.
44017885a7bSLuigi RizzoUser code should process them and advance
44117885a7bSLuigi Rizzo.Va head
44217885a7bSLuigi Rizzoand
44317885a7bSLuigi Rizzo.Va cur
44417885a7bSLuigi Rizzopast slots it wants to return to the kernel.
44517885a7bSLuigi Rizzo.Va cur
44617885a7bSLuigi Rizzomay be moved further ahead if the user code wants to
44717885a7bSLuigi Rizzowait for more packets
44817885a7bSLuigi Rizzowithout returning all the previous slots to the kernel.
44917885a7bSLuigi Rizzo.Pp
45017885a7bSLuigi RizzoAt the next NIOCRXSYNC/select()/poll(),
45117885a7bSLuigi Rizzoslots up to
45217885a7bSLuigi Rizzo.Va head-1
45317885a7bSLuigi Rizzoare returned to the kernel for further receives, and
45417885a7bSLuigi Rizzo.Va tail
45517885a7bSLuigi Rizzomay advance to report new incoming packets.
4563f879a47SEnji Cooper.Pp
45717885a7bSLuigi RizzoBelow is an example of the evolution of an RX ring:
45817885a7bSLuigi Rizzo.Bd -literal
45917885a7bSLuigi Rizzo    after the syscall, there are some (h)eld and some (R)eceived slots
46017885a7bSLuigi Rizzo           head  cur     tail
46117885a7bSLuigi Rizzo            |     |       |
46217885a7bSLuigi Rizzo            v     v       v
46317885a7bSLuigi Rizzo     RX  [..hhhhhhRRRRRRRR..........]
46417885a7bSLuigi Rizzo
46517885a7bSLuigi Rizzo    user advances head and cur, releasing some slots and holding others
46617885a7bSLuigi Rizzo               head cur  tail
46717885a7bSLuigi Rizzo                 |  |     |
46817885a7bSLuigi Rizzo                 v  v     v
46917885a7bSLuigi Rizzo     RX  [..*****hhhRRRRRR...........]
47017885a7bSLuigi Rizzo
47117885a7bSLuigi Rizzo    NICRXSYNC/poll()/select() recovers slots and reports new packets
47217885a7bSLuigi Rizzo               head cur        tail
47317885a7bSLuigi Rizzo                 |  |           |
47417885a7bSLuigi Rizzo                 v  v           v
47517885a7bSLuigi Rizzo     RX  [.......hhhRRRRRRRRRRRR....]
47617885a7bSLuigi Rizzo.Ed
47717885a7bSLuigi Rizzo.Sh SLOTS AND PACKET BUFFERS
47817885a7bSLuigi RizzoNormally, packets should be stored in the netmap-allocated buffers
47917885a7bSLuigi Rizzoassigned to slots when ports are bound to a file descriptor.
48017885a7bSLuigi RizzoOne packet is fully contained in a single buffer.
48117885a7bSLuigi Rizzo.Pp
48217885a7bSLuigi RizzoThe following flags affect slot and buffer processing:
483ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX
484ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED
485e91d04f7SChristian Brueffer.Em must
486e91d04f7SChristian Bruefferbe used when the
487e91d04f7SChristian Brueffer.Va buf_idx
488e91d04f7SChristian Bruefferin the slot is changed.
48917885a7bSLuigi RizzoThis can be used to implement
49017885a7bSLuigi Rizzozero-copy forwarding, see
49117885a7bSLuigi Rizzo.Sx ZERO-COPY FORWARDING .
492ce3ee1e7SLuigi Rizzo.It NS_REPORT
49317885a7bSLuigi Rizzoreports when this buffer has been transmitted.
494ce3ee1e7SLuigi RizzoNormally,
495ce3ee1e7SLuigi Rizzo.Nm
496ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals
497e91d04f7SChristian Brueffercan be delayed indefinitely.
498e91d04f7SChristian BruefferThis flag helps detect
499e91d04f7SChristian Bruefferwhen packets have been sent and a file descriptor can be closed.
500ce3ee1e7SLuigi Rizzo.It NS_FORWARD
50117885a7bSLuigi RizzoWhen a ring is in 'transparent' mode (see
50217885a7bSLuigi Rizzo.Sx TRANSPARENT MODE ) ,
503e91d04f7SChristian Bruefferpackets marked with this flag are forwarded to the other endpoint
50417885a7bSLuigi Rizzoat the next system call, thus restoring (in a selective way)
50517885a7bSLuigi Rizzothe connection between a NIC and the host stack.
506ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN
507e91d04f7SChristian Brueffertells the forwarding code that the source MAC address for this
50817885a7bSLuigi Rizzopacket must not be used in the learning bridge code.
509ce3ee1e7SLuigi Rizzo.It NS_INDIRECT
510e91d04f7SChristian Bruefferindicates that the packet's payload is in a user-supplied buffer
51117885a7bSLuigi Rizzowhose user virtual address is in the 'ptr' field of the slot.
512ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes.
5133f879a47SEnji Cooper.Pp
51417885a7bSLuigi RizzoThis is only supported on the transmit ring of
51517885a7bSLuigi Rizzo.Nm VALE
51617885a7bSLuigi Rizzoports, and it helps reducing data copies in the interconnection
51717885a7bSLuigi Rizzoof virtual machines.
518ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG
519ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers;
520ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear.
521ce3ee1e7SLuigi Rizzo.El
52217885a7bSLuigi Rizzo.Sh SCATTER GATHER I/O
52317885a7bSLuigi RizzoPackets can span multiple slots if the
52417885a7bSLuigi Rizzo.Va NS_MOREFRAG
52517885a7bSLuigi Rizzoflag is set in all but the last slot.
52617885a7bSLuigi RizzoThe maximum length of a chain is 64 buffers.
52717885a7bSLuigi RizzoThis is normally used with
52817885a7bSLuigi Rizzo.Nm VALE
52917885a7bSLuigi Rizzoports when connecting virtual machines, as they generate large
53017885a7bSLuigi RizzoTSO segments that are not split unless they reach a physical device.
53117885a7bSLuigi Rizzo.Pp
53217885a7bSLuigi RizzoNOTE: The length field always refers to the individual
53317885a7bSLuigi Rizzofragment; there is no place with the total length of a packet.
53417885a7bSLuigi Rizzo.Pp
53517885a7bSLuigi RizzoOn receive rings the macro
53617885a7bSLuigi Rizzo.Va NS_RFRAGS(slot)
53717885a7bSLuigi Rizzoindicates the remaining number of slots for this packet,
53817885a7bSLuigi Rizzoincluding the current one.
53917885a7bSLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set.
54013a5d88fSLuigi Rizzo.Sh IOCTLS
54168b8534bSLuigi Rizzo.Nm
54217885a7bSLuigi Rizzouses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
543e91d04f7SChristian Bruefferfor non-blocking I/O.
544e91d04f7SChristian BruefferThey take no argument.
54517885a7bSLuigi RizzoTwo more ioctls (NIOCGINFO, NIOCREGIF) are used
54617885a7bSLuigi Rizzoto query and configure ports, with the following argument:
54768b8534bSLuigi Rizzo.Bd -literal
54868b8534bSLuigi Rizzostruct nmreq {
54917885a7bSLuigi Rizzo    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
55017885a7bSLuigi Rizzo    uint32_t  nr_version;        /* (i) API version                */
55117885a7bSLuigi Rizzo    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
55217885a7bSLuigi Rizzo    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
553fa7db06bSLuigi Rizzo    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
554fa7db06bSLuigi Rizzo    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
555fa7db06bSLuigi Rizzo    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
556e91d04f7SChristian Brueffer    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
557fa7db06bSLuigi Rizzo    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
55817885a7bSLuigi Rizzo    uint16_t  nr_cmd;            /* (i) special command            */
559fa7db06bSLuigi Rizzo    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
560fa7db06bSLuigi Rizzo    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
561fa7db06bSLuigi Rizzo    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
562fa7db06bSLuigi Rizzo    uint32_t  nr_flags           /* (i/o) open mode                */
56317885a7bSLuigi Rizzo    ...
56468b8534bSLuigi Rizzo};
56568b8534bSLuigi Rizzo.Ed
56668b8534bSLuigi Rizzo.Pp
56717885a7bSLuigi RizzoA file descriptor obtained through
56817885a7bSLuigi Rizzo.Pa /dev/netmap
56917885a7bSLuigi Rizzoalso supports the ioctl supported by network devices, see
57017885a7bSLuigi Rizzo.Xr netintro 4 .
57168b8534bSLuigi Rizzo.Bl -tag -width XXXX
57268b8534bSLuigi Rizzo.It Dv NIOCGINFO
57317885a7bSLuigi Rizzoreturns EINVAL if the named port does not support netmap.
574ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information
57517885a7bSLuigi Rizzoabout the port.
576ce3ee1e7SLuigi RizzoNote that all the information below can change before the
577ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode.
57817885a7bSLuigi Rizzo.Bl -tag -width XX
57917885a7bSLuigi Rizzo.It Pa nr_memsize
58017885a7bSLuigi Rizzoindicates the size of the
58117885a7bSLuigi Rizzo.Nm
582e91d04f7SChristian Brueffermemory region.
583e91d04f7SChristian BruefferNICs in
58417885a7bSLuigi Rizzo.Nm
58517885a7bSLuigi Rizzomode all share the same memory region,
58617885a7bSLuigi Rizzowhereas
58717885a7bSLuigi Rizzo.Nm VALE
58817885a7bSLuigi Rizzoports have independent regions for each port.
58917885a7bSLuigi Rizzo.It Pa nr_tx_slots , nr_rx_slots
590ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings.
59117885a7bSLuigi Rizzo.It Pa nr_tx_rings , nr_rx_rings
592ce3ee1e7SLuigi Rizzoindicate the number of transmit
593ce3ee1e7SLuigi Rizzoand receive rings.
594ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime
5953f879a47SEnji Cooperusing interface-specific functions (e.g.,
5963f879a47SEnji Cooper.Xr ethtool 8
59717885a7bSLuigi Rizzo).
59817885a7bSLuigi Rizzo.El
59968b8534bSLuigi Rizzo.It Dv NIOCREGIF
60017885a7bSLuigi Rizzobinds the port named in
60117885a7bSLuigi Rizzo.Va nr_name
602e91d04f7SChristian Bruefferto the file descriptor.
603e91d04f7SChristian BruefferFor a physical device this also switches it into
60417885a7bSLuigi Rizzo.Nm
60517885a7bSLuigi Rizzomode, disconnecting
60617885a7bSLuigi Rizzoit from the host stack.
60717885a7bSLuigi RizzoMultiple file descriptors can be bound to the same port,
60817885a7bSLuigi Rizzowith proper synchronization left to the user.
60917885a7bSLuigi Rizzo.Pp
61037e3a6d3SLuigi RizzoThe recommended way to bind a file descriptor to a port is
61137e3a6d3SLuigi Rizzoto use function
61237e3a6d3SLuigi Rizzo.Va nm_open(..)
61337e3a6d3SLuigi Rizzo(see
6143f879a47SEnji Cooper.Sx LIBRARIES )
61537e3a6d3SLuigi Rizzowhich parses names to access specific port types and
61637e3a6d3SLuigi Rizzoenable features.
61737e3a6d3SLuigi RizzoIn the following we document the main features.
61837e3a6d3SLuigi Rizzo.Pp
619fa7db06bSLuigi Rizzo.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
620fa7db06bSLuigi Rizzo.Em netmap pipe ,
621fa7db06bSLuigi Rizzoconsisting of two netmap ports with a crossover connection.
622fa7db06bSLuigi RizzoA netmap pipe share the same memory space of the parent port,
623fa7db06bSLuigi Rizzoand is meant to enable configuration where a master process acts
624fa7db06bSLuigi Rizzoas a dispatcher towards slave processes.
625fa7db06bSLuigi Rizzo.Pp
626fa7db06bSLuigi RizzoTo enable this function, the
627fa7db06bSLuigi Rizzo.Pa nr_arg1
628fa7db06bSLuigi Rizzofield of the structure can be used as a hint to the kernel to
629fa7db06bSLuigi Rizzoindicate how many pipes we expect to use, and reserve extra space
630fa7db06bSLuigi Rizzoin the memory region.
631fa7db06bSLuigi Rizzo.Pp
632fa7db06bSLuigi RizzoOn return, it gives the same info as NIOCGINFO,
633fa7db06bSLuigi Rizzowith
634fa7db06bSLuigi Rizzo.Pa nr_ringid
635fa7db06bSLuigi Rizzoand
636fa7db06bSLuigi Rizzo.Pa nr_flags
637fa7db06bSLuigi Rizzoindicating the identity of the rings controlled through the file
63868b8534bSLuigi Rizzodescriptor.
63968b8534bSLuigi Rizzo.Pp
640fa7db06bSLuigi Rizzo.Va nr_flags
64117885a7bSLuigi Rizzo.Va nr_ringid
64217885a7bSLuigi Rizzoselects which rings are controlled through this file descriptor.
643fa7db06bSLuigi RizzoPossible values of
644fa7db06bSLuigi Rizzo.Pa nr_flags
645fa7db06bSLuigi Rizzoare indicated below, together with the naming schemes
646fa7db06bSLuigi Rizzothat application libraries (such as the
647fa7db06bSLuigi Rizzo.Nm nm_open
648fa7db06bSLuigi Rizzoindicated below) can use to indicate the specific set of rings.
649fa7db06bSLuigi RizzoIn the example below, "netmap:foo" is any valid netmap port name.
65068b8534bSLuigi Rizzo.Bl -tag -width XXXXX
651fa7db06bSLuigi Rizzo.It NR_REG_ALL_NIC                         "netmap:foo"
652fa7db06bSLuigi Rizzo(default) all hardware ring pairs
653415dfa83SMaxim Sobolev.It NR_REG_SW            "netmap:foo^"
65417885a7bSLuigi Rizzothe ``host rings'', connecting to the host stack.
655d4d112e3SJoel Dahl.It NR_REG_NIC_SW        "netmap:foo+"
656fa7db06bSLuigi Rizzoall hardware rings and the host rings
657fa7db06bSLuigi Rizzo.It NR_REG_ONE_NIC       "netmap:foo-i"
658fa7db06bSLuigi Rizzoonly the i-th hardware ring pair, where the number is in
659fa7db06bSLuigi Rizzo.Pa nr_ringid ;
660fa7db06bSLuigi Rizzo.It NR_REG_PIPE_MASTER  "netmap:foo{i"
661fa7db06bSLuigi Rizzothe master side of the netmap pipe whose identifier (i) is in
662fa7db06bSLuigi Rizzo.Pa nr_ringid ;
663fa7db06bSLuigi Rizzo.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
664fa7db06bSLuigi Rizzothe slave side of the netmap pipe whose identifier (i) is in
665fa7db06bSLuigi Rizzo.Pa nr_ringid .
666fa7db06bSLuigi Rizzo.Pp
667fa7db06bSLuigi RizzoThe identifier of a pipe must be thought as part of the pipe name,
668e91d04f7SChristian Bruefferand does not need to be sequential.
669e91d04f7SChristian BruefferOn return the pipe
670fa7db06bSLuigi Rizzowill only have a single ring pair with index 0,
671e91d04f7SChristian Bruefferirrespective of the value of
672e91d04f7SChristian Brueffer.Va i.
67368b8534bSLuigi Rizzo.El
67417885a7bSLuigi Rizzo.Pp
67568b8534bSLuigi RizzoBy default, a
67617885a7bSLuigi Rizzo.Xr poll 2
67768b8534bSLuigi Rizzoor
67817885a7bSLuigi Rizzo.Xr select 2
67968b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if
68068b8534bSLuigi Rizzono write events are specified.
68168b8534bSLuigi RizzoThe feature can be disabled by or-ing
682415dfa83SMaxim Sobolev.Va NETMAP_NO_TX_POLL
68317885a7bSLuigi Rizzoto the value written to
68417885a7bSLuigi Rizzo.Va nr_ringid.
68517885a7bSLuigi RizzoWhen this feature is used,
68617885a7bSLuigi Rizzopackets are transmitted only on
68717885a7bSLuigi Rizzo.Va ioctl(NIOCTXSYNC)
68817885a7bSLuigi Rizzoor select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
689ce3ee1e7SLuigi Rizzo.Pp
690ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a
691ce3ee1e7SLuigi Rizzo.Xr vale 4
692ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default,
693ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
69468b8534bSLuigi Rizzo.It Dv NIOCTXSYNC
69568b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the
69668b8534bSLuigi Rizzonumber of slots available for transmission.
69768b8534bSLuigi Rizzo.It Dv NIOCRXSYNC
69868b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available
69968b8534bSLuigi Rizzopackets.
70068b8534bSLuigi Rizzo.El
701fa7db06bSLuigi Rizzo.Sh SELECT, POLL, EPOLL, KQUEUE.
70217885a7bSLuigi Rizzo.Xr select 2
70317885a7bSLuigi Rizzoand
70417885a7bSLuigi Rizzo.Xr poll 2
70517885a7bSLuigi Rizzoon a
70617885a7bSLuigi Rizzo.Nm
70717885a7bSLuigi Rizzofile descriptor process rings as indicated in
70817885a7bSLuigi Rizzo.Sx TRANSMIT RINGS
70917885a7bSLuigi Rizzoand
710fa7db06bSLuigi Rizzo.Sx RECEIVE RINGS ,
711fa7db06bSLuigi Rizzorespectively when write (POLLOUT) and read (POLLIN) events are requested.
712fa7db06bSLuigi RizzoBoth block if no slots are available in the ring
713fa7db06bSLuigi Rizzo.Va ( ring->cur == ring->tail ) .
714fa7db06bSLuigi RizzoDepending on the platform,
715fa7db06bSLuigi Rizzo.Xr epoll 2
716fa7db06bSLuigi Rizzoand
717fa7db06bSLuigi Rizzo.Xr kqueue 2
718fa7db06bSLuigi Rizzoare supported too.
71917885a7bSLuigi Rizzo.Pp
720fa7db06bSLuigi RizzoPackets in transmit rings are normally pushed out
721fa7db06bSLuigi Rizzo(and buffers reclaimed) even without
722e91d04f7SChristian Bruefferrequesting write events.
723e91d04f7SChristian BruefferPassing the
724e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL
725e91d04f7SChristian Bruefferflag to
72617885a7bSLuigi Rizzo.Em NIOCREGIF
72717885a7bSLuigi Rizzodisables this feature.
728fa7db06bSLuigi RizzoBy default, receive rings are processed only if read
729e91d04f7SChristian Bruefferevents are requested.
730e91d04f7SChristian BruefferPassing the
731e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL
732e91d04f7SChristian Bruefferflag to
733fa7db06bSLuigi Rizzo.Em NIOCREGIF updates receive rings even without read events.
734e91d04f7SChristian BruefferNote that on epoll and kqueue,
735e91d04f7SChristian Brueffer.Dv NETMAP_NO_TX_POLL
736e91d04f7SChristian Bruefferand
737e91d04f7SChristian Brueffer.Dv NETMAP_DO_RX_POLL
738fa7db06bSLuigi Rizzoonly have an effect when some event is posted for the file descriptor.
73917885a7bSLuigi Rizzo.Sh LIBRARIES
74017885a7bSLuigi RizzoThe
74117885a7bSLuigi Rizzo.Nm
74217885a7bSLuigi RizzoAPI is supposed to be used directly, both because of its simplicity and
74317885a7bSLuigi Rizzofor efficient integration with applications.
74417885a7bSLuigi Rizzo.Pp
745e91d04f7SChristian BruefferFor convenience, the
746e91d04f7SChristian Brueffer.In net/netmap_user.h
74717885a7bSLuigi Rizzoheader provides a few macros and functions to ease creating
74817885a7bSLuigi Rizzoa file descriptor and doing I/O with a
74917885a7bSLuigi Rizzo.Nm
750e91d04f7SChristian Bruefferport.
751e91d04f7SChristian BruefferThese are loosely modeled after the
75217885a7bSLuigi Rizzo.Xr pcap 3
75317885a7bSLuigi RizzoAPI, to ease porting of libpcap-based applications to
75417885a7bSLuigi Rizzo.Nm .
75517885a7bSLuigi RizzoTo use these extra functions, programs should
75617885a7bSLuigi Rizzo.Dl #define NETMAP_WITH_LIBS
75717885a7bSLuigi Rizzobefore
75817885a7bSLuigi Rizzo.Dl #include <net/netmap_user.h>
75917885a7bSLuigi Rizzo.Pp
76017885a7bSLuigi RizzoThe following functions are available:
76117885a7bSLuigi Rizzo.Bl -tag -width XXXXX
762fa7db06bSLuigi Rizzo.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
76317885a7bSLuigi Rizzosimilar to
7643f879a47SEnji Cooper.Xr pcap_open 3pcap ,
76517885a7bSLuigi Rizzobinds a file descriptor to a port.
76617885a7bSLuigi Rizzo.Bl -tag -width XX
76717885a7bSLuigi Rizzo.It Va ifname
76837e3a6d3SLuigi Rizzois a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
76917885a7bSLuigi Rizzo.Nm VALE
77017885a7bSLuigi Rizzoport.
771fa7db06bSLuigi Rizzo.It Va req
772fa7db06bSLuigi Rizzoprovides the initial values for the argument to the NIOCREGIF ioctl.
773fa7db06bSLuigi RizzoThe nm_flags and nm_ringid values are overwritten by parsing
774fa7db06bSLuigi Rizzoifname and flags, and other fields can be overridden through
775fa7db06bSLuigi Rizzothe other two arguments.
776fa7db06bSLuigi Rizzo.It Va arg
7773f879a47SEnji Cooperpoints to a struct nm_desc containing arguments (e.g., from a previously
778fa7db06bSLuigi Rizzoopen file descriptor) that should override the defaults.
779fa7db06bSLuigi RizzoThe fields are used as described below
78017885a7bSLuigi Rizzo.It Va flags
781fa7db06bSLuigi Rizzocan be set to a combination of the following flags:
782fa7db06bSLuigi Rizzo.Va NETMAP_NO_TX_POLL ,
783fa7db06bSLuigi Rizzo.Va NETMAP_DO_RX_POLL
784fa7db06bSLuigi Rizzo(copied into nr_ringid);
785fa7db06bSLuigi Rizzo.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
786fa7db06bSLuigi Rizzoavoids the mmap and uses the values from it);
787fa7db06bSLuigi Rizzo.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
788fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG1 ,
789fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG2 ,
790fa7db06bSLuigi Rizzo.Va NM_OPEN_ARG3 (uses the fields from arg);
791fa7db06bSLuigi Rizzo.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
79217885a7bSLuigi Rizzo.El
793fa7db06bSLuigi Rizzo.It Va int nm_close(struct nm_desc *d)
79417885a7bSLuigi Rizzocloses the file descriptor, unmaps memory, frees resources.
795fa7db06bSLuigi Rizzo.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
79617885a7bSLuigi Rizzosimilar to pcap_inject(), pushes a packet to a ring, returns the size
79717885a7bSLuigi Rizzoof the packet is successful, or 0 on error;
798fa7db06bSLuigi Rizzo.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
79917885a7bSLuigi Rizzosimilar to pcap_dispatch(), applies a callback to incoming packets
800fa7db06bSLuigi Rizzo.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
80117885a7bSLuigi Rizzosimilar to pcap_next(), fetches the next packet
80217885a7bSLuigi Rizzo.El
80317885a7bSLuigi Rizzo.Sh SUPPORTED DEVICES
80417885a7bSLuigi Rizzo.Nm
80517885a7bSLuigi Rizzonatively supports the following devices:
80617885a7bSLuigi Rizzo.Pp
80717885a7bSLuigi RizzoOn FreeBSD:
80837e3a6d3SLuigi Rizzo.Xr cxgbe 4 ,
80917885a7bSLuigi Rizzo.Xr em 4 ,
81017885a7bSLuigi Rizzo.Xr igb 4 ,
81117885a7bSLuigi Rizzo.Xr ixgbe 4 ,
81237e3a6d3SLuigi Rizzo.Xr ixl 4 ,
81317885a7bSLuigi Rizzo.Xr lem 4 ,
81417885a7bSLuigi Rizzo.Xr re 4 .
81517885a7bSLuigi Rizzo.Pp
81617885a7bSLuigi RizzoOn Linux
81717885a7bSLuigi Rizzo.Xr e1000 4 ,
81817885a7bSLuigi Rizzo.Xr e1000e 4 ,
81937e3a6d3SLuigi Rizzo.Xr i40e 4 ,
82017885a7bSLuigi Rizzo.Xr igb 4 ,
82117885a7bSLuigi Rizzo.Xr ixgbe 4 ,
82217885a7bSLuigi Rizzo.Xr r8169 4 .
82317885a7bSLuigi Rizzo.Pp
82417885a7bSLuigi RizzoNICs without native support can still be used in
82517885a7bSLuigi Rizzo.Nm
826e91d04f7SChristian Brueffermode through emulation.
827e91d04f7SChristian BruefferPerformance is inferior to native netmap
82837e3a6d3SLuigi Rizzomode but still significantly higher than various raw socket types
82937e3a6d3SLuigi Rizzo(bpf, PF_PACKET, etc.).
83037e3a6d3SLuigi RizzoNote that for slow devices (such as 1 Gbit/s and slower NICs,
831fbb9e715SLuigi Rizzoor several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
832fbb9e715SLuigi Rizzoemulated and native mode will likely have similar or same throughput.
8333f879a47SEnji Cooper.Pp
83437e3a6d3SLuigi RizzoWhen emulation is in use, packet sniffer programs such as tcpdump
8353f879a47SEnji Coopercould see received packets before they are diverted by netmap.
8363f879a47SEnji CooperThis behaviour is not intentional, being just an artifact of the implementation
8373f879a47SEnji Cooperof emulation.
83837e3a6d3SLuigi RizzoNote that in case the netmap application subsequently moves packets received
83937e3a6d3SLuigi Rizzofrom the emulated adapter onto the host RX ring, the sniffer will intercept
84037e3a6d3SLuigi Rizzothose packets again, since the packets are injected to the host stack as they
84137e3a6d3SLuigi Rizzowere received by the network interface.
84217885a7bSLuigi Rizzo.Pp
84317885a7bSLuigi RizzoEmulation is also available for devices with native netmap support,
84417885a7bSLuigi Rizzowhich can be used for testing or performance comparison.
84517885a7bSLuigi RizzoThe sysctl variable
84617885a7bSLuigi Rizzo.Va dev.netmap.admode
84717885a7bSLuigi Rizzoglobally controls how netmap mode is implemented.
84817885a7bSLuigi Rizzo.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
84917885a7bSLuigi RizzoSome aspect of the operation of
85017885a7bSLuigi Rizzo.Nm
85117885a7bSLuigi Rizzoare controlled through sysctl variables on FreeBSD
85217885a7bSLuigi Rizzo.Em ( dev.netmap.* )
85317885a7bSLuigi Rizzoand module parameters on Linux
85417885a7bSLuigi Rizzo.Em ( /sys/module/netmap_lin/parameters/* ) :
85517885a7bSLuigi Rizzo.Bl -tag -width indent
85617885a7bSLuigi Rizzo.It Va dev.netmap.admode: 0
85717885a7bSLuigi RizzoControls the use of native or emulated adapter mode.
8583f879a47SEnji Cooper.Pp
85937e3a6d3SLuigi Rizzo0 uses the best available option;
8603f879a47SEnji Cooper.Pp
86137e3a6d3SLuigi Rizzo1 forces native mode and fails if not available;
8623f879a47SEnji Cooper.Pp
86337e3a6d3SLuigi Rizzo2 forces emulated hence never fails.
86417885a7bSLuigi Rizzo.It Va dev.netmap.generic_ringsize: 1024
86517885a7bSLuigi RizzoRing size used for emulated netmap mode
86617885a7bSLuigi Rizzo.It Va dev.netmap.generic_mit: 100000
86717885a7bSLuigi RizzoControls interrupt moderation for emulated mode
86817885a7bSLuigi Rizzo.It Va dev.netmap.mmap_unreg: 0
86917885a7bSLuigi Rizzo.It Va dev.netmap.fwd: 0
87017885a7bSLuigi RizzoForces NS_FORWARD mode
87117885a7bSLuigi Rizzo.It Va dev.netmap.flags: 0
87217885a7bSLuigi Rizzo.It Va dev.netmap.txsync_retry: 2
87317885a7bSLuigi Rizzo.It Va dev.netmap.no_pendintr: 1
87417885a7bSLuigi RizzoForces recovery of transmit buffers on system calls
87517885a7bSLuigi Rizzo.It Va dev.netmap.mitigate: 1
87617885a7bSLuigi RizzoPropagates interrupt mitigation to user processes
87717885a7bSLuigi Rizzo.It Va dev.netmap.no_timestamp: 0
87817885a7bSLuigi RizzoDisables the update of the timestamp in the netmap ring
87917885a7bSLuigi Rizzo.It Va dev.netmap.verbose: 0
88017885a7bSLuigi RizzoVerbose kernel messages
88117885a7bSLuigi Rizzo.It Va dev.netmap.buf_num: 163840
88217885a7bSLuigi Rizzo.It Va dev.netmap.buf_size: 2048
88317885a7bSLuigi Rizzo.It Va dev.netmap.ring_num: 200
88417885a7bSLuigi Rizzo.It Va dev.netmap.ring_size: 36864
88517885a7bSLuigi Rizzo.It Va dev.netmap.if_num: 100
88617885a7bSLuigi Rizzo.It Va dev.netmap.if_size: 1024
88717885a7bSLuigi RizzoSizes and number of objects (netmap_if, netmap_ring, buffers)
888e91d04f7SChristian Bruefferfor the global memory region.
889e91d04f7SChristian BruefferThe only parameter worth modifying is
89017885a7bSLuigi Rizzo.Va dev.netmap.buf_num
89117885a7bSLuigi Rizzoas it impacts the total amount of memory used by netmap.
89217885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_num: 0
89317885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_size: 0
89417885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_num: 0
89517885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_size: 0
89617885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_num: 0
89717885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_size: 0
89817885a7bSLuigi RizzoActual values in use.
89917885a7bSLuigi Rizzo.It Va dev.netmap.bridge_batch: 1024
90017885a7bSLuigi RizzoBatch size used when moving packets across a
90117885a7bSLuigi Rizzo.Nm VALE
902e91d04f7SChristian Bruefferswitch.
903e91d04f7SChristian BruefferValues above 64 generally guarantee good
90417885a7bSLuigi Rizzoperformance.
90517885a7bSLuigi Rizzo.El
90613a5d88fSLuigi Rizzo.Sh SYSTEM CALLS
90768b8534bSLuigi Rizzo.Nm
90868b8534bSLuigi Rizzouses
909fa7db06bSLuigi Rizzo.Xr select 2 ,
910fa7db06bSLuigi Rizzo.Xr poll 2 ,
91137e3a6d3SLuigi Rizzo.Xr epoll 2
91268b8534bSLuigi Rizzoand
91337e3a6d3SLuigi Rizzo.Xr kqueue 2
914ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and
915ce3ee1e7SLuigi Rizzo.Xr mmap 2
916ce3ee1e7SLuigi Rizzoto map memory.
91717885a7bSLuigi Rizzo.Xr ioctl 2
91817885a7bSLuigi Rizzois used to configure ports and
91917885a7bSLuigi Rizzo.Nm VALE switches .
920ce3ee1e7SLuigi Rizzo.Pp
921ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to
922ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard
923ce3ee1e7SLuigi RizzoOS primitives, see
924ce3ee1e7SLuigi Rizzo.Xr pthread 3 .
925ce3ee1e7SLuigi RizzoIn particular,
926ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3
927ce3ee1e7SLuigi Rizzomay be of use.
92868b8534bSLuigi Rizzo.Sh EXAMPLES
92917885a7bSLuigi Rizzo.Ss TEST PROGRAMS
93017885a7bSLuigi Rizzo.Nm
93117885a7bSLuigi Rizzocomes with a few programs that can be used for testing or
93217885a7bSLuigi Rizzosimple applications.
93317885a7bSLuigi RizzoSee the
934e91d04f7SChristian Brueffer.Pa examples/
93517885a7bSLuigi Rizzodirectory in
93617885a7bSLuigi Rizzo.Nm
93717885a7bSLuigi Rizzodistributions, or
938e91d04f7SChristian Brueffer.Pa tools/tools/netmap/
939e91d04f7SChristian Bruefferdirectory in
940e91d04f7SChristian Brueffer.Fx
941e91d04f7SChristian Bruefferdistributions.
94217885a7bSLuigi Rizzo.Pp
9433f879a47SEnji Cooper.Xr pkt-gen 8
94417885a7bSLuigi Rizzois a general purpose traffic source/sink.
94517885a7bSLuigi Rizzo.Pp
94617885a7bSLuigi RizzoAs an example
94717885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f tx -l 60
94817885a7bSLuigi Rizzocan generate an infinite stream of minimum size packets, and
94917885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f rx
95017885a7bSLuigi Rizzois a traffic sink.
95117885a7bSLuigi RizzoBoth print traffic statistics, to help monitor
95217885a7bSLuigi Rizzohow the system performs.
95317885a7bSLuigi Rizzo.Pp
9543f879a47SEnji Cooper.Xr pkt-gen 8
95517885a7bSLuigi Rizzohas many options can be uses to set packet sizes, addresses,
95617885a7bSLuigi Rizzorates, and use multiple send/receive threads and cores.
95717885a7bSLuigi Rizzo.Pp
9583f879a47SEnji Cooper.Xr bridge 4
95917885a7bSLuigi Rizzois another test program which interconnects two
96017885a7bSLuigi Rizzo.Nm
961e91d04f7SChristian Bruefferports.
962e91d04f7SChristian BruefferIt can be used for transparent forwarding between
96317885a7bSLuigi Rizzointerfaces, as in
96417885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix1
96517885a7bSLuigi Rizzoor even connect the NIC to the host stack using netmap
96617885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix0
96717885a7bSLuigi Rizzo.Ss USING THE NATIVE API
96868b8534bSLuigi RizzoThe following code implements a traffic generator
96968b8534bSLuigi Rizzo.Pp
97068b8534bSLuigi Rizzo.Bd -literal -compact
97168b8534bSLuigi Rizzo#include <net/netmap_user.h>
972fe1e4a6cSBaptiste Daroussin\&...
97317885a7bSLuigi Rizzovoid sender(void)
97417885a7bSLuigi Rizzo{
97568b8534bSLuigi Rizzo    struct netmap_if *nifp;
97668b8534bSLuigi Rizzo    struct netmap_ring *ring;
977d83a410eSHiren Panchasara    struct nmreq nmr;
97817885a7bSLuigi Rizzo    struct pollfd fds;
97968b8534bSLuigi Rizzo
98068b8534bSLuigi Rizzo    fd = open("/dev/netmap", O_RDWR);
98168b8534bSLuigi Rizzo    bzero(&nmr, sizeof(nmr));
982d83a410eSHiren Panchasara    strcpy(nmr.nr_name, "ix0");
983ce3ee1e7SLuigi Rizzo    nmr.nm_version = NETMAP_API;
984ce3ee1e7SLuigi Rizzo    ioctl(fd, NIOCREGIF, &nmr);
985d83a410eSHiren Panchasara    p = mmap(0, nmr.nr_memsize, fd);
986ce3ee1e7SLuigi Rizzo    nifp = NETMAP_IF(p, nmr.nr_offset);
98768b8534bSLuigi Rizzo    ring = NETMAP_TXRING(nifp, 0);
98868b8534bSLuigi Rizzo    fds.fd = fd;
98968b8534bSLuigi Rizzo    fds.events = POLLOUT;
99068b8534bSLuigi Rizzo    for (;;) {
99117885a7bSLuigi Rizzo	poll(&fds, 1, -1);
99217885a7bSLuigi Rizzo	while (!nm_ring_empty(ring)) {
99368b8534bSLuigi Rizzo	    i = ring->cur;
99468b8534bSLuigi Rizzo	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
99568b8534bSLuigi Rizzo	    ... prepare packet in buf ...
99668b8534bSLuigi Rizzo	    ring->slot[i].len = ... packet length ...
99717885a7bSLuigi Rizzo	    ring->head = ring->cur = nm_ring_next(ring, i);
99817885a7bSLuigi Rizzo	}
99968b8534bSLuigi Rizzo    }
100068b8534bSLuigi Rizzo}
100168b8534bSLuigi Rizzo.Ed
100217885a7bSLuigi Rizzo.Ss HELPER FUNCTIONS
100317885a7bSLuigi RizzoA simple receiver can be implemented using the helper functions
100417885a7bSLuigi Rizzo.Bd -literal -compact
100517885a7bSLuigi Rizzo#define NETMAP_WITH_LIBS
100617885a7bSLuigi Rizzo#include <net/netmap_user.h>
1007fe1e4a6cSBaptiste Daroussin\&...
100817885a7bSLuigi Rizzovoid receiver(void)
100917885a7bSLuigi Rizzo{
1010fa7db06bSLuigi Rizzo    struct nm_desc *d;
101117885a7bSLuigi Rizzo    struct pollfd fds;
101217885a7bSLuigi Rizzo    u_char *buf;
1013fa7db06bSLuigi Rizzo    struct nm_pkthdr h;
101417885a7bSLuigi Rizzo    ...
101517885a7bSLuigi Rizzo    d = nm_open("netmap:ix0", NULL, 0, 0);
101617885a7bSLuigi Rizzo    fds.fd = NETMAP_FD(d);
101717885a7bSLuigi Rizzo    fds.events = POLLIN;
101817885a7bSLuigi Rizzo    for (;;) {
101917885a7bSLuigi Rizzo	poll(&fds, 1, -1);
102017885a7bSLuigi Rizzo        while ( (buf = nm_nextpkt(d, &h)) )
102117885a7bSLuigi Rizzo	    consume_pkt(buf, h->len);
102217885a7bSLuigi Rizzo    }
102317885a7bSLuigi Rizzo    nm_close(d);
102417885a7bSLuigi Rizzo}
102517885a7bSLuigi Rizzo.Ed
102617885a7bSLuigi Rizzo.Ss ZERO-COPY FORWARDING
102717885a7bSLuigi RizzoSince physical interfaces share the same memory region,
102817885a7bSLuigi Rizzoit is possible to do packet forwarding between ports
1029e91d04f7SChristian Bruefferswapping buffers.
1030e91d04f7SChristian BruefferThe buffer from the transmit ring is used
103117885a7bSLuigi Rizzoto replenish the receive ring:
103217885a7bSLuigi Rizzo.Bd -literal -compact
103317885a7bSLuigi Rizzo    uint32_t tmp;
103417885a7bSLuigi Rizzo    struct netmap_slot *src, *dst;
103517885a7bSLuigi Rizzo    ...
103617885a7bSLuigi Rizzo    src = &src_ring->slot[rxr->cur];
103717885a7bSLuigi Rizzo    dst = &dst_ring->slot[txr->cur];
103817885a7bSLuigi Rizzo    tmp = dst->buf_idx;
103917885a7bSLuigi Rizzo    dst->buf_idx = src->buf_idx;
104017885a7bSLuigi Rizzo    dst->len = src->len;
104117885a7bSLuigi Rizzo    dst->flags = NS_BUF_CHANGED;
104217885a7bSLuigi Rizzo    src->buf_idx = tmp;
104317885a7bSLuigi Rizzo    src->flags = NS_BUF_CHANGED;
104417885a7bSLuigi Rizzo    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
104517885a7bSLuigi Rizzo    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
104617885a7bSLuigi Rizzo    ...
104717885a7bSLuigi Rizzo.Ed
104817885a7bSLuigi Rizzo.Ss ACCESSING THE HOST STACK
1049fa7db06bSLuigi RizzoThe host stack is for all practical purposes just a regular ring pair,
10503f879a47SEnji Cooperwhich you can access with the netmap API (e.g., with
1051fa7db06bSLuigi Rizzo.Dl nm_open("netmap:eth0^", ... ) ;
1052fa7db06bSLuigi RizzoAll packets that the host would send to an interface in
1053fa7db06bSLuigi Rizzo.Nm
1054fa7db06bSLuigi Rizzomode end up into the RX ring, whereas all packets queued to the
1055fa7db06bSLuigi RizzoTX ring are send up to the host stack.
105617885a7bSLuigi Rizzo.Ss VALE SWITCH
105717885a7bSLuigi RizzoA simple way to test the performance of a
105817885a7bSLuigi Rizzo.Nm VALE
105917885a7bSLuigi Rizzoswitch is to attach a sender and a receiver to it,
10603f879a47SEnji Coopere.g., running the following in two different terminals:
106117885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:a -f rx # receiver
106217885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:b -f tx # sender
1063fa7db06bSLuigi RizzoThe same example can be used to test netmap pipes, by simply
10643f879a47SEnji Cooperchanging port names, e.g.,
106537e3a6d3SLuigi Rizzo.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
106637e3a6d3SLuigi Rizzo.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
106717885a7bSLuigi Rizzo.Pp
106817885a7bSLuigi RizzoThe following command attaches an interface and the host stack
106917885a7bSLuigi Rizzoto a switch:
107017885a7bSLuigi Rizzo.Dl vale-ctl -h vale2:em0
107117885a7bSLuigi RizzoOther
107268b8534bSLuigi Rizzo.Nm
107317885a7bSLuigi Rizzoclients attached to the same switch can now communicate
107417885a7bSLuigi Rizzowith the network card or the host.
107513a5d88fSLuigi Rizzo.Sh SEE ALSO
1076*1a7d3c05SVincenzo Maffione.Xr pkt-gen 8 ,
1077*1a7d3c05SVincenzo Maffione.Xr bridge 8
1078*1a7d3c05SVincenzo Maffione.Pp
10790b3504fdSChristian Brueffer.Pa http://info.iet.unipi.it/~luigi/netmap/
108013a5d88fSLuigi Rizzo.Pp
108113a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework,
108213a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012
108313a5d88fSLuigi Rizzo.Pp
108413a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O,
108513a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston
1086fa7db06bSLuigi Rizzo.Pp
1087fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri,
1088fa7db06bSLuigi RizzoVALE, a switched ethernet for virtual machines,
1089fa7db06bSLuigi RizzoACM CoNEXT'12, December 2012, Nice
1090fa7db06bSLuigi Rizzo.Pp
1091fa7db06bSLuigi RizzoLuigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1092fa7db06bSLuigi RizzoSpeeding up packet I/O in virtual machines,
1093fa7db06bSLuigi RizzoACM/IEEE ANCS'13, October 2013, San Jose
109468b8534bSLuigi Rizzo.Sh AUTHORS
109513a5d88fSLuigi Rizzo.An -nosplit
109668b8534bSLuigi RizzoThe
109768b8534bSLuigi Rizzo.Nm
1098ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the
109913a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by
110013a5d88fSLuigi Rizzo.An Luigi Rizzo ,
1101ce3ee1e7SLuigi Rizzoand further extended with help from
110213a5d88fSLuigi Rizzo.An Matteo Landi ,
110313a5d88fSLuigi Rizzo.An Gaetano Catalli ,
1104ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri ,
1105e91d04f7SChristian Bruefferand
1106ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione .
110713a5d88fSLuigi Rizzo.Pp
110813a5d88fSLuigi Rizzo.Nm
1109ce3ee1e7SLuigi Rizzoand
1110ce3ee1e7SLuigi Rizzo.Nm VALE
1111ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects
1112ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581).
1113bf15fc88SJoel Dahl.Sh CAVEATS
1114bf15fc88SJoel DahlNo matter how fast the CPU and OS are,
1115bf15fc88SJoel Dahlachieving line rate on 10G and faster interfaces
1116bf15fc88SJoel Dahlrequires hardware with sufficient performance.
1117bf15fc88SJoel DahlSeveral NICs are unable to sustain line rate with
1118e91d04f7SChristian Brueffersmall packet sizes.
1119e91d04f7SChristian BruefferInsufficient PCIe or memory bandwidth
1120bf15fc88SJoel Dahlcan also cause reduced performance.
1121bf15fc88SJoel Dahl.Pp
1122bf15fc88SJoel DahlAnother frequent reason for low performance is the use
1123bf15fc88SJoel Dahlof flow control on the link: a slow receiver can limit
1124bf15fc88SJoel Dahlthe transmit speed.
1125bf15fc88SJoel DahlBe sure to disable flow control when running high
1126bf15fc88SJoel Dahlspeed experiments.
1127bf15fc88SJoel Dahl.Ss SPECIAL NIC FEATURES
1128bf15fc88SJoel Dahl.Nm
1129bf15fc88SJoel Dahlis orthogonal to some NIC features such as
1130bf15fc88SJoel Dahlmultiqueue, schedulers, packet filters.
1131bf15fc88SJoel Dahl.Pp
1132bf15fc88SJoel DahlMultiple transmit and receive rings are supported natively
1133bf15fc88SJoel Dahland can be configured with ordinary OS tools,
1134bf15fc88SJoel Dahlsuch as
11353f879a47SEnji Cooper.Xr ethtool 8
1136bf15fc88SJoel Dahlor
1137bf15fc88SJoel Dahldevice-specific sysctl variables.
1138bf15fc88SJoel DahlThe same goes for Receive Packet Steering (RPS)
1139bf15fc88SJoel Dahland filtering of incoming traffic.
1140bf15fc88SJoel Dahl.Pp
1141bf15fc88SJoel Dahl.Nm
1142bf15fc88SJoel Dahl.Em does not use
1143bf15fc88SJoel Dahlfeatures such as
1144bf15fc88SJoel Dahl.Em checksum offloading , TCP segmentation offloading ,
1145bf15fc88SJoel Dahl.Em encryption , VLAN encapsulation/decapsulation ,
1146e91d04f7SChristian Bruefferetc.
1147bf15fc88SJoel DahlWhen using netmap to exchange packets with the host stack,
1148bf15fc88SJoel Dahlmake sure to disable these features.
1149