xref: /freebsd/share/man/man4/netmap.4 (revision ce3ee1e7c4cac5b86bbc15daac68f2129aa42187)
1*ce3ee1e7SLuigi Rizzo.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
268b8534bSLuigi Rizzo.\" All rights reserved.
368b8534bSLuigi Rizzo.\"
468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without
568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions
668b8534bSLuigi Rizzo.\" are met:
768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright
868b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer.
968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright
1068b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer in the
1168b8534bSLuigi Rizzo.\"    documentation and/or other materials provided with the distribution.
1268b8534bSLuigi Rizzo.\"
1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
2368b8534bSLuigi Rizzo.\" SUCH DAMAGE.
2468b8534bSLuigi Rizzo.\"
2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4)
2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix.
2768b8534bSLuigi Rizzo.\"
2868b8534bSLuigi Rizzo.\" $FreeBSD$
2968b8534bSLuigi Rizzo.\"
30*ce3ee1e7SLuigi Rizzo.Dd October 18, 2013
3168b8534bSLuigi Rizzo.Dt NETMAP 4
3268b8534bSLuigi Rizzo.Os
3368b8534bSLuigi Rizzo.Sh NAME
3468b8534bSLuigi Rizzo.Nm netmap
3568b8534bSLuigi Rizzo.Nd a framework for fast packet I/O
3668b8534bSLuigi Rizzo.Sh SYNOPSIS
3768b8534bSLuigi Rizzo.Cd device netmap
3868b8534bSLuigi Rizzo.Sh DESCRIPTION
3968b8534bSLuigi Rizzo.Nm
40*ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O
41*ce3ee1e7SLuigi Rizzo(reaching 14.88 Mpps with a single core at less than 1 GHz)
42*ce3ee1e7SLuigi Rizzofor both userspace and kernel clients.
43*ce3ee1e7SLuigi RizzoUserspace clients can use the netmap API
44*ce3ee1e7SLuigi Rizzoto send and receive raw packets through physical interfaces
45*ce3ee1e7SLuigi Rizzoor ports of the
46*ce3ee1e7SLuigi Rizzo.Xr VALE 4
47*ce3ee1e7SLuigi Rizzoswitch.
48*ce3ee1e7SLuigi Rizzo.Pp
49*ce3ee1e7SLuigi Rizzo.Nm VALE
50*ce3ee1e7SLuigi Rizzois a very fast (reaching 20 Mpps per port)
51*ce3ee1e7SLuigi Rizzoand modular software switch,
52*ce3ee1e7SLuigi Rizzoimplemented within the kernel, which can interconnect
53*ce3ee1e7SLuigi Rizzovirtual ports, physical devices, and the native host stack.
54*ce3ee1e7SLuigi Rizzo.Pp
5568b8534bSLuigi Rizzo.Nm
56*ce3ee1e7SLuigi Rizzouses a memory mapped region to share packet buffers,
57*ce3ee1e7SLuigi Rizzodescriptors and queues with the kernel.
58*ce3ee1e7SLuigi RizzoSimple
59*ce3ee1e7SLuigi Rizzo.Pa ioctl()s
60*ce3ee1e7SLuigi Rizzoare used to bind interfaces/ports to file descriptors and
61*ce3ee1e7SLuigi Rizzoimplement non-blocking I/O, whereas blocking I/O uses
6268b8534bSLuigi Rizzo.Pa select()/poll() .
6368b8534bSLuigi Rizzo.Nm
6468b8534bSLuigi Rizzocan exploit the parallelism in multiqueue devices and
6568b8534bSLuigi Rizzomulticore systems.
6668b8534bSLuigi Rizzo.Pp
67*ce3ee1e7SLuigi RizzoFor the best performance,
6868b8534bSLuigi Rizzo.Nm
69*ce3ee1e7SLuigi Rizzorequires explicit support in device drivers;
70*ce3ee1e7SLuigi Rizzoa generic emulation layer is available to implement the
7168b8534bSLuigi Rizzo.Nm
72*ce3ee1e7SLuigi RizzoAPI on top of unmodified device drivers,
73*ce3ee1e7SLuigi Rizzoat the price of reduced performance
74*ce3ee1e7SLuigi Rizzo(but still better than what can be achieved with
75*ce3ee1e7SLuigi Rizzosockets or BPF/pcap).
76*ce3ee1e7SLuigi Rizzo.Pp
77*ce3ee1e7SLuigi RizzoFor a list of devices with native
78*ce3ee1e7SLuigi Rizzo.Nm
79*ce3ee1e7SLuigi Rizzosupport, see the end of this manual page.
80*ce3ee1e7SLuigi Rizzo.Pp
81*ce3ee1e7SLuigi Rizzo.Sh OPERATION - THE NETMAP API
82*ce3ee1e7SLuigi Rizzo.Nm
83*ce3ee1e7SLuigi Rizzoclients must first
8468b8534bSLuigi Rizzo.Pa open("/dev/netmap") ,
8568b8534bSLuigi Rizzoand then issue an
86*ce3ee1e7SLuigi Rizzo.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg)
87*ce3ee1e7SLuigi Rizzoto bind the file descriptor to a specific interface or port.
8868b8534bSLuigi Rizzo.Nm
89*ce3ee1e7SLuigi Rizzohas multiple modes of operation controlled by the
90*ce3ee1e7SLuigi Rizzocontent of the
91*ce3ee1e7SLuigi Rizzo.Pa struct nmreq
92*ce3ee1e7SLuigi Rizzopassed to the
93*ce3ee1e7SLuigi Rizzo.Pa ioctl() .
94*ce3ee1e7SLuigi RizzoIn particular, the
95*ce3ee1e7SLuigi Rizzo.Em nr_name
96*ce3ee1e7SLuigi Rizzofield specifies whether the client operates on a physical network
97*ce3ee1e7SLuigi Rizzointerface or on a port of a
98*ce3ee1e7SLuigi Rizzo.Nm VALE
99*ce3ee1e7SLuigi Rizzoswitch, as indicated below. Additional fields in the
100*ce3ee1e7SLuigi Rizzo.Pa struct nmreq
101*ce3ee1e7SLuigi Rizzocontrol the details of operation.
10268b8534bSLuigi Rizzo.Pp
103*ce3ee1e7SLuigi Rizzo.Bl -tag -width XXXX
104*ce3ee1e7SLuigi Rizzo.It Dv Interface name (e.g. 'em0', 'eth1', ... )
105*ce3ee1e7SLuigi RizzoThe data path of the interface is disconnected from the host stack.
106*ce3ee1e7SLuigi RizzoDepending on additional arguments,
107*ce3ee1e7SLuigi Rizzothe file descriptor is bound to the NIC (one or all queues),
108*ce3ee1e7SLuigi Rizzoor to the host stack.
109*ce3ee1e7SLuigi Rizzo.It Dv valeXXX:YYY (arbitrary XXX and YYY)
110*ce3ee1e7SLuigi RizzoThe file descriptor is bound to port YYY of a VALE switch called XXX,
111*ce3ee1e7SLuigi Rizzowhere XXX and YYY are arbitrary alphanumeric strings.
112*ce3ee1e7SLuigi RizzoThe string cannot exceed IFNAMSIZ characters, and YYY cannot
113*ce3ee1e7SLuigi Rizzomatching the name of any existing interface.
114*ce3ee1e7SLuigi Rizzo.Pp
115*ce3ee1e7SLuigi RizzoThe switch and the port are created if not existing.
116*ce3ee1e7SLuigi Rizzo.It Dv valeXXX:ifname (ifname is an existing interface)
117*ce3ee1e7SLuigi RizzoFlags in the argument control whether the physical interface
118*ce3ee1e7SLuigi Rizzo(and optionally the corrisponding host stack endpoint)
119*ce3ee1e7SLuigi Rizzoare connected or disconnected from the VALE switch named XXX.
120*ce3ee1e7SLuigi Rizzo.Pp
121*ce3ee1e7SLuigi RizzoIn this case the
122*ce3ee1e7SLuigi Rizzo.Pa ioctl()
123*ce3ee1e7SLuigi Rizzois used only for configuring the VALE switch, typically through the
124*ce3ee1e7SLuigi Rizzo.Nm vale-ctl
125*ce3ee1e7SLuigi Rizzocommand.
126*ce3ee1e7SLuigi RizzoThe file descriptor cannot be used for I/O, and should be
127*ce3ee1e7SLuigi Rizzo.Pa close()d
128*ce3ee1e7SLuigi Rizzoafter issuing the
129*ce3ee1e7SLuigi Rizzo.Pa ioctl().
130*ce3ee1e7SLuigi Rizzo.El
131*ce3ee1e7SLuigi Rizzo.Pp
132*ce3ee1e7SLuigi RizzoThe binding can be removed (and the interface returns to
133*ce3ee1e7SLuigi Rizzoregular operation, or the virtual port destroyed) with a
134*ce3ee1e7SLuigi Rizzo.Pa close()
135*ce3ee1e7SLuigi Rizzoon the file descriptor.
136*ce3ee1e7SLuigi Rizzo.Pp
137*ce3ee1e7SLuigi RizzoThe processes owning the file descriptor can then
138*ce3ee1e7SLuigi Rizzo.Pa mmap()
139*ce3ee1e7SLuigi Rizzothe memory region that contains pre-allocated
140*ce3ee1e7SLuigi Rizzobuffers, descriptors and queues, and use them to
141*ce3ee1e7SLuigi Rizzoread/write raw packets.
14268b8534bSLuigi RizzoNon blocking I/O is done with special
14368b8534bSLuigi Rizzo.Pa ioctl()'s ,
14468b8534bSLuigi Rizzowhereas the file descriptor can be passed to
14568b8534bSLuigi Rizzo.Pa select()/poll()
14668b8534bSLuigi Rizzoto be notified about incoming packet or available transmit buffers.
147*ce3ee1e7SLuigi Rizzo.Ss DATA STRUCTURES
148*ce3ee1e7SLuigi RizzoThe data structures in the mmapped memory are described below
149*ce3ee1e7SLuigi Rizzo(see
150*ce3ee1e7SLuigi Rizzo.Xr sys/net/netmap.h
151*ce3ee1e7SLuigi Rizzofor reference).
152*ce3ee1e7SLuigi RizzoAll physical devices operating in
15368b8534bSLuigi Rizzo.Nm
154*ce3ee1e7SLuigi Rizzomode use the same memory region,
155*ce3ee1e7SLuigi Rizzoshared by the kernel and all processes who own
15668b8534bSLuigi Rizzo.Pa /dev/netmap
157*ce3ee1e7SLuigi Rizzodescriptors bound to those devices
15868b8534bSLuigi Rizzo(NOTE: visibility may be restricted in future implementations).
159*ce3ee1e7SLuigi RizzoVirtual ports instead use separate memory regions,
160*ce3ee1e7SLuigi Rizzoshared only with the kernel.
161*ce3ee1e7SLuigi Rizzo.Pp
16268b8534bSLuigi RizzoAll references between the shared data structure
16368b8534bSLuigi Rizzoare relative (offsets or indexes). Some macros help converting
16468b8534bSLuigi Rizzothem into actual pointers.
16568b8534bSLuigi Rizzo.Bl -tag -width XXX
16668b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface)
16768b8534bSLuigi Rizzoindicates the number of rings supported by an interface, their
16868b8534bSLuigi Rizzosizes, and the offsets of the
16968b8534bSLuigi Rizzo.Pa netmap_rings
17068b8534bSLuigi Rizzoassociated to the interface.
171*ce3ee1e7SLuigi Rizzo.Pp
17268b8534bSLuigi Rizzo.Pa struct netmap_if
173*ce3ee1e7SLuigi Rizzois at offset
17468b8534bSLuigi Rizzo.Pa nr_offset
175*ce3ee1e7SLuigi Rizzoin the shared memory region is indicated by the
17668b8534bSLuigi Rizzofield in the structure returned by the
17768b8534bSLuigi Rizzo.Pa NIOCREGIF
17868b8534bSLuigi Rizzo(see below).
17968b8534bSLuigi Rizzo.Bd -literal
18068b8534bSLuigi Rizzostruct netmap_if {
18168b8534bSLuigi Rizzo    char          ni_name[IFNAMSIZ]; /* name of the interface.    */
182*ce3ee1e7SLuigi Rizzo    const u_int   ni_version;        /* API version               */
183*ce3ee1e7SLuigi Rizzo    const u_int   ni_rx_rings;       /* number of rx ring pairs   */
184*ce3ee1e7SLuigi Rizzo    const u_int   ni_tx_rings;       /* if 0, same as ni_rx_rings */
18568b8534bSLuigi Rizzo    const ssize_t ring_ofs[];        /* offset of tx and rx rings */
18668b8534bSLuigi Rizzo};
18768b8534bSLuigi Rizzo.Ed
18868b8534bSLuigi Rizzo.It Dv struct netmap_ring (one per ring)
189*ce3ee1e7SLuigi RizzoContains the positions in the transmit and receive rings to
190*ce3ee1e7SLuigi Rizzosynchronize the kernel and the application,
19168b8534bSLuigi Rizzoand an array of
19268b8534bSLuigi Rizzo.Pa slots
19368b8534bSLuigi Rizzodescribing the buffers.
194*ce3ee1e7SLuigi Rizzo'reserved' is used in receive rings to tell the kernel the
195*ce3ee1e7SLuigi Rizzonumber of slots after 'cur' that are still in usr
196*ce3ee1e7SLuigi Rizzoindicates how many slots starting from 'cur'
197*ce3ee1e7SLuigi Rizzothe
198*ce3ee1e7SLuigi Rizzo.Pp
199*ce3ee1e7SLuigi RizzoEach physical interface has one
200*ce3ee1e7SLuigi Rizzo.Pa netmap_ring
201*ce3ee1e7SLuigi Rizzofor each hardware transmit and receive ring,
202*ce3ee1e7SLuigi Rizzoplus one extra transmit and one receive structure
203*ce3ee1e7SLuigi Rizzothat connect to the host stack.
20468b8534bSLuigi Rizzo.Bd -literal
20568b8534bSLuigi Rizzostruct netmap_ring {
206*ce3ee1e7SLuigi Rizzo    const ssize_t  buf_ofs;   /* see details */
207*ce3ee1e7SLuigi Rizzo    const uint32_t num_slots; /* number of slots in the ring */
20868b8534bSLuigi Rizzo    uint32_t       avail;     /* number of usable slots      */
209*ce3ee1e7SLuigi Rizzo    uint32_t       cur;       /* 'current' read/write index  */
21064ae02c3SLuigi Rizzo    uint32_t       reserved;  /* not refilled before current */
21168b8534bSLuigi Rizzo
21268b8534bSLuigi Rizzo    const uint16_t nr_buf_size;
21368b8534bSLuigi Rizzo    uint16_t       flags;
214*ce3ee1e7SLuigi Rizzo#define NR_TIMESTAMP 0x0002   /* set timestamp on *sync()    */
215*ce3ee1e7SLuigi Rizzo#define NR_FORWARD   0x0004   /* enable NS_FORWARD for ring  */
216*ce3ee1e7SLuigi Rizzo#define NR_RX_TSTMP  0x0008   /* set rx timestamp in slots   */
217*ce3ee1e7SLuigi Rizzo    struct timeval ts;
218*ce3ee1e7SLuigi Rizzo    struct netmap_slot slot[0]; /* array of slots            */
21968b8534bSLuigi Rizzo}
22068b8534bSLuigi Rizzo.Ed
221*ce3ee1e7SLuigi Rizzo.Pp
222*ce3ee1e7SLuigi RizzoIn transmit rings, after a system call 'cur' indicates
223*ce3ee1e7SLuigi Rizzothe first slot that can be used for transmissions,
224*ce3ee1e7SLuigi Rizzoand 'avail' reports how many of them are available.
225*ce3ee1e7SLuigi RizzoBefore the next netmap-related system call on the file
226*ce3ee1e7SLuigi Rizzodescriptor, the application should fill buffers and
227*ce3ee1e7SLuigi Rizzoslots with data, and update 'cur' and 'avail'
228*ce3ee1e7SLuigi Rizzoaccordingly, as shown in the figure below:
229*ce3ee1e7SLuigi Rizzo.Bd -literal
230*ce3ee1e7SLuigi Rizzo
231*ce3ee1e7SLuigi Rizzo              cur
232*ce3ee1e7SLuigi Rizzo               |----- avail ---|   (after syscall)
233*ce3ee1e7SLuigi Rizzo               v
234*ce3ee1e7SLuigi Rizzo     TX  [*****aaaaaaaaaaaaaaaaa**]
235*ce3ee1e7SLuigi Rizzo     TX  [*****TTTTTaaaaaaaaaaaa**]
236*ce3ee1e7SLuigi Rizzo                    ^
237*ce3ee1e7SLuigi Rizzo                    |-- avail --|   (before syscall)
238*ce3ee1e7SLuigi Rizzo                   cur
239*ce3ee1e7SLuigi Rizzo.Ed
240*ce3ee1e7SLuigi Rizzo
241*ce3ee1e7SLuigi RizzoIn receive rings, after a system call 'cur' indicates
242*ce3ee1e7SLuigi Rizzothe first slot that contains a valid packet,
243*ce3ee1e7SLuigi Rizzoand 'avail' reports how many of them are available.
244*ce3ee1e7SLuigi RizzoBefore the next netmap-related system call on the file
245*ce3ee1e7SLuigi Rizzodescriptor, the application can process buffers and
246*ce3ee1e7SLuigi Rizzorelease them to the kernel updating
247*ce3ee1e7SLuigi Rizzo'cur' and 'avail' accordingly, as shown in the figure below.
248*ce3ee1e7SLuigi RizzoReceive rings have an additional field called 'reserved'
249*ce3ee1e7SLuigi Rizzoto indicate how many buffers before 'cur' are still
250*ce3ee1e7SLuigi Rizzounder processing and cannot be released.
251*ce3ee1e7SLuigi Rizzo.Bd -literal
252*ce3ee1e7SLuigi Rizzo                 cur
253*ce3ee1e7SLuigi Rizzo            |-res-|-- avail --|   (after syscall)
254*ce3ee1e7SLuigi Rizzo                  v
255*ce3ee1e7SLuigi Rizzo     RX  [**rrrrrrRRRRRRRRRRRR******]
256*ce3ee1e7SLuigi Rizzo     RX  [**...........rrrrRRR******]
257*ce3ee1e7SLuigi Rizzo                       |res|--|<avail (before syscall)
258*ce3ee1e7SLuigi Rizzo                           ^
259*ce3ee1e7SLuigi Rizzo                          cur
260*ce3ee1e7SLuigi Rizzo
261*ce3ee1e7SLuigi Rizzo.Ed
26268b8534bSLuigi Rizzo.It Dv struct netmap_slot (one per packet)
263*ce3ee1e7SLuigi Rizzocontains the metadata for a packet:
26468b8534bSLuigi Rizzo.Bd -literal
26568b8534bSLuigi Rizzostruct netmap_slot {
26668b8534bSLuigi Rizzo    uint32_t buf_idx; /* buffer index */
26768b8534bSLuigi Rizzo    uint16_t len;   /* packet length */
26868b8534bSLuigi Rizzo    uint16_t flags; /* buf changed, etc. */
26968b8534bSLuigi Rizzo#define NS_BUF_CHANGED  0x0001  /* must resync, buffer changed */
27068b8534bSLuigi Rizzo#define NS_REPORT       0x0002  /* tell hw to report results
27168b8534bSLuigi Rizzo                                 * e.g. by generating an interrupt
27268b8534bSLuigi Rizzo                                 */
273*ce3ee1e7SLuigi Rizzo#define NS_FORWARD      0x0004  /* pass packet to the other endpoint
274*ce3ee1e7SLuigi Rizzo                                 * (host stack or device)
275*ce3ee1e7SLuigi Rizzo                                 */
276*ce3ee1e7SLuigi Rizzo#define NS_NO_LEARN     0x0008
277*ce3ee1e7SLuigi Rizzo#define NS_INDIRECT     0x0010
278*ce3ee1e7SLuigi Rizzo#define NS_MOREFRAG     0x0020
279*ce3ee1e7SLuigi Rizzo#define NS_PORT_SHIFT   8
280*ce3ee1e7SLuigi Rizzo#define NS_PORT_MASK    (0xff << NS_PORT_SHIFT)
281*ce3ee1e7SLuigi Rizzo#define NS_RFRAGS(_slot)        ( ((_slot)->flags >> 8) & 0xff)
282*ce3ee1e7SLuigi Rizzo    uint64_t ptr;   /* buffer address (indirect buffers) */
28368b8534bSLuigi Rizzo};
28468b8534bSLuigi Rizzo.Ed
285*ce3ee1e7SLuigi RizzoThe flags control how the the buffer associated to the slot
286*ce3ee1e7SLuigi Rizzoshould be managed.
28768b8534bSLuigi Rizzo.It Dv packet buffers
288*ce3ee1e7SLuigi Rizzoare normally fixed size (2 Kbyte) buffers allocated by the kernel
28968b8534bSLuigi Rizzothat contain packet data. Buffers addresses are computed through
29068b8534bSLuigi Rizzomacros.
29168b8534bSLuigi Rizzo.El
29268b8534bSLuigi Rizzo.Pp
293*ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX
29468b8534bSLuigi RizzoSome macros support the access to objects in the shared memory
295*ce3ee1e7SLuigi Rizzoregion. In particular,
296*ce3ee1e7SLuigi Rizzo.It NETMAP_TXRING(nifp, i)
297*ce3ee1e7SLuigi Rizzo.It NETMAP_RXRING(nifp, i)
298*ce3ee1e7SLuigi Rizzoreturn the address of the i-th transmit and receive ring,
299*ce3ee1e7SLuigi Rizzorespectively, whereas
300*ce3ee1e7SLuigi Rizzo.It NETMAP_BUF(ring, buf_idx)
301*ce3ee1e7SLuigi Rizzoreturns the address of the buffer with index buf_idx
302*ce3ee1e7SLuigi Rizzo(which can be part of any ring for the given interface).
303*ce3ee1e7SLuigi Rizzo.El
304*ce3ee1e7SLuigi Rizzo.Pp
305*ce3ee1e7SLuigi RizzoNormally, buffers are associated to slots when interfaces are bound,
306*ce3ee1e7SLuigi Rizzoand one packet is fully contained in a single buffer.
307*ce3ee1e7SLuigi RizzoClients can however modify the mapping using the
308*ce3ee1e7SLuigi Rizzofollowing flags:
309*ce3ee1e7SLuigi Rizzo.Ss FLAGS
310*ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX
311*ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED
312*ce3ee1e7SLuigi Rizzoindicates that the buf_idx in the slot has changed.
313*ce3ee1e7SLuigi RizzoThis can be useful if the client wants to implement
314*ce3ee1e7SLuigi Rizzosome form of zero-copy forwarding (e.g. by passing buffers
315*ce3ee1e7SLuigi Rizzofrom an input interface to an output interface), or
316*ce3ee1e7SLuigi Rizzoneeds to process packets out of order.
317*ce3ee1e7SLuigi Rizzo.Pp
318*ce3ee1e7SLuigi RizzoThe flag MUST be used whenever the buffer index is changed.
319*ce3ee1e7SLuigi Rizzo.It NS_REPORT
320*ce3ee1e7SLuigi Rizzoindicates that we want to be woken up when this buffer
321*ce3ee1e7SLuigi Rizzohas been transmitted. This reduces performance but insures
322*ce3ee1e7SLuigi Rizzoa prompt notification when a buffer has been sent.
323*ce3ee1e7SLuigi RizzoNormally,
324*ce3ee1e7SLuigi Rizzo.Nm
325*ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals
326*ce3ee1e7SLuigi Rizzocan be delayed indefinitely. However, we need such notifications
327*ce3ee1e7SLuigi Rizzobefore closing a descriptor.
328*ce3ee1e7SLuigi Rizzo.It NS_FORWARD
329*ce3ee1e7SLuigi RizzoWhen the device is open in 'transparent' mode,
330*ce3ee1e7SLuigi Rizzothe client can mark slots in receive rings with this flag.
331*ce3ee1e7SLuigi RizzoFor all marked slots, marked packets are forwarded to
332*ce3ee1e7SLuigi Rizzothe other endpoint at the next system call, thus restoring
333*ce3ee1e7SLuigi Rizzo(in a selective way) the connection between the NIC and the
334*ce3ee1e7SLuigi Rizzohost stack.
335*ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN
336*ce3ee1e7SLuigi Rizzotells the forwarding code that the SRC MAC address for this
337*ce3ee1e7SLuigi Rizzopacket should not be used in the learning bridge
338*ce3ee1e7SLuigi Rizzo.It NS_INDIRECT
339*ce3ee1e7SLuigi Rizzoindicates that the packet's payload is not in the netmap
340*ce3ee1e7SLuigi Rizzosupplied buffer, but in a user-supplied buffer whose
341*ce3ee1e7SLuigi Rizzouser virtual address is in the 'ptr' field of the slot.
342*ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes.
343*ce3ee1e7SLuigi Rizzo.Em This is only supported on the transmit ring of virtual ports
344*ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG
345*ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers;
346*ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear.
347*ce3ee1e7SLuigi RizzoThe maximum length of a chain is 64 buffers.
348*ce3ee1e7SLuigi Rizzo.Em This is only supported on virtual ports
349*ce3ee1e7SLuigi Rizzo.It ns_ctr
350*ce3ee1e7SLuigi Rizzoon receive rings, contains the number of remaining buffers
351*ce3ee1e7SLuigi Rizzoin a packet, including this one.
352*ce3ee1e7SLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set.
353*ce3ee1e7SLuigi RizzoThe length refers to the individual buffer, there is no
354*ce3ee1e7SLuigi Rizzofield for the total length
355*ce3ee1e7SLuigi RizzoXXX maybe put it in the ptr field ?
356*ce3ee1e7SLuigi Rizzo.Pp
357*ce3ee1e7SLuigi RizzoOn transmit rings, if NS_DST is set, it is passed to the lookup
358*ce3ee1e7SLuigi Rizzofunction, which can use it e.g. as the index of the destination
359*ce3ee1e7SLuigi Rizzoport instead of doing an address lookup.
360*ce3ee1e7SLuigi Rizzo.El
36113a5d88fSLuigi Rizzo.Sh IOCTLS
36268b8534bSLuigi Rizzo.Nm
36368b8534bSLuigi Rizzosupports some ioctl() to synchronize the state of the rings
36468b8534bSLuigi Rizzobetween the kernel and the user processes, plus some
36568b8534bSLuigi Rizzoto query and configure the interface.
36668b8534bSLuigi RizzoThe former do not require any argument, whereas the latter
36768b8534bSLuigi Rizzouse a
368*ce3ee1e7SLuigi Rizzo.Pa struct nmreq
36968b8534bSLuigi Rizzodefined as follows:
37068b8534bSLuigi Rizzo.Bd -literal
37168b8534bSLuigi Rizzostruct nmreq {
37268b8534bSLuigi Rizzo        char      nr_name[IFNAMSIZ];
37364ae02c3SLuigi Rizzo        uint32_t  nr_version;     /* API version */
374*ce3ee1e7SLuigi Rizzo#define NETMAP_API      4         /* current version */
37568b8534bSLuigi Rizzo        uint32_t  nr_offset;      /* nifp offset in the shared region */
37668b8534bSLuigi Rizzo        uint32_t  nr_memsize;     /* size of the shared region */
37764ae02c3SLuigi Rizzo        uint32_t  nr_tx_slots;    /* slots in tx rings */
37864ae02c3SLuigi Rizzo        uint32_t  nr_rx_slots;    /* slots in rx rings */
37964ae02c3SLuigi Rizzo        uint16_t  nr_tx_rings;    /* number of tx rings */
38064ae02c3SLuigi Rizzo        uint16_t  nr_rx_rings;    /* number of tx rings */
38168b8534bSLuigi Rizzo        uint16_t  nr_ringid;      /* ring(s) we care about */
38268b8534bSLuigi Rizzo#define NETMAP_HW_RING  0x4000    /* low bits indicate one hw ring */
38368b8534bSLuigi Rizzo#define NETMAP_SW_RING  0x2000    /* we process the sw ring */
38468b8534bSLuigi Rizzo#define NETMAP_NO_TX_POLL 0x1000  /* no gratuitous txsync on poll */
38568b8534bSLuigi Rizzo#define NETMAP_RING_MASK 0xfff    /* the actual ring number */
386*ce3ee1e7SLuigi Rizzo        uint16_t        nr_cmd;
387*ce3ee1e7SLuigi Rizzo#define NETMAP_BDG_ATTACH       1       /* attach the NIC */
388*ce3ee1e7SLuigi Rizzo#define NETMAP_BDG_DETACH       2       /* detach the NIC */
389*ce3ee1e7SLuigi Rizzo#define NETMAP_BDG_LOOKUP_REG   3       /* register lookup function */
390*ce3ee1e7SLuigi Rizzo#define NETMAP_BDG_LIST         4       /* get bridge's info */
391*ce3ee1e7SLuigi Rizzo	uint16_t	nr_arg1;
392*ce3ee1e7SLuigi Rizzo	uint16_t	nr_arg2;
393*ce3ee1e7SLuigi Rizzo        uint32_t        spare2[3];
39468b8534bSLuigi Rizzo};
39568b8534bSLuigi Rizzo
39668b8534bSLuigi Rizzo.Ed
39768b8534bSLuigi RizzoA device descriptor obtained through
39868b8534bSLuigi Rizzo.Pa /dev/netmap
39968b8534bSLuigi Rizzoalso supports the ioctl supported by network devices.
40068b8534bSLuigi Rizzo.Pp
40168b8534bSLuigi RizzoThe netmap-specific
40268b8534bSLuigi Rizzo.Xr ioctl 2
40368b8534bSLuigi Rizzocommand codes below are defined in
40468b8534bSLuigi Rizzo.In net/netmap.h
40568b8534bSLuigi Rizzoand are:
40668b8534bSLuigi Rizzo.Bl -tag -width XXXX
40768b8534bSLuigi Rizzo.It Dv NIOCGINFO
408*ce3ee1e7SLuigi Rizzoreturns EINVAL if the named device does not support netmap.
409*ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information
410*ce3ee1e7SLuigi Rizzoabout the interface.
411*ce3ee1e7SLuigi RizzoNote that all the information below can change before the
412*ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode.
41368b8534bSLuigi Rizzo.Pp
414*ce3ee1e7SLuigi Rizzo.Pa nr_memsize
415*ce3ee1e7SLuigi Rizzoindicates the size of the netmap
416*ce3ee1e7SLuigi Rizzomemory region. Physical devices all share the same memory region,
417*ce3ee1e7SLuigi Rizzowhereas VALE ports may have independent regions for each port.
418*ce3ee1e7SLuigi RizzoThese sizes can be set through system-wise sysctl variables.
419*ce3ee1e7SLuigi Rizzo.Pa nr_tx_slots, nr_rx_slots
420*ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings.
421*ce3ee1e7SLuigi Rizzo.Pa nr_tx_rings, nr_rx_rings
422*ce3ee1e7SLuigi Rizzoindicate the number of transmit
423*ce3ee1e7SLuigi Rizzoand receive rings.
424*ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime
425*ce3ee1e7SLuigi Rizzousing interface-specific functions (e.g.
426*ce3ee1e7SLuigi Rizzo.Pa sysctl
427*ce3ee1e7SLuigi Rizzoor
428*ce3ee1e7SLuigi Rizzo.Pa ethtool .
42968b8534bSLuigi Rizzo.It Dv NIOCREGIF
43068b8534bSLuigi Rizzoputs the interface named in nr_name into netmap mode, disconnecting
43168b8534bSLuigi Rizzoit from the host stack, and/or defines which rings are controlled
43268b8534bSLuigi Rizzothrough this file descriptor.
43368b8534bSLuigi RizzoOn return, it gives the same info as NIOCGINFO, and nr_ringid
43468b8534bSLuigi Rizzoindicates the identity of the rings controlled through the file
43568b8534bSLuigi Rizzodescriptor.
43668b8534bSLuigi Rizzo.Pp
43768b8534bSLuigi RizzoPossible values for nr_ringid are
43868b8534bSLuigi Rizzo.Bl -tag -width XXXXX
43968b8534bSLuigi Rizzo.It 0
44068b8534bSLuigi Rizzodefault, all hardware rings
44168b8534bSLuigi Rizzo.It NETMAP_SW_RING
44268b8534bSLuigi Rizzothe ``host rings'' connecting to the host stack
44368b8534bSLuigi Rizzo.It NETMAP_HW_RING + i
44468b8534bSLuigi Rizzothe i-th hardware ring
44568b8534bSLuigi Rizzo.El
44668b8534bSLuigi RizzoBy default, a
44768b8534bSLuigi Rizzo.Nm poll
44868b8534bSLuigi Rizzoor
44968b8534bSLuigi Rizzo.Nm select
45068b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if
45168b8534bSLuigi Rizzono write events are specified.
45268b8534bSLuigi RizzoThe feature can be disabled by or-ing
45368b8534bSLuigi Rizzo.Nm NETMAP_NO_TX_SYNC
45468b8534bSLuigi Rizzoto nr_ringid.
45568b8534bSLuigi RizzoBut normally you should keep this feature unless you are using
45668b8534bSLuigi Rizzoseparate file descriptors for the send and receive rings, because
45768b8534bSLuigi Rizzootherwise packets are pushed out only if NETMAP_TXSYNC is called,
45868b8534bSLuigi Rizzoor the send queue is full.
45968b8534bSLuigi Rizzo.Pp
46068b8534bSLuigi Rizzo.Pa NIOCREGIF
46168b8534bSLuigi Rizzocan be used multiple times to change the association of a
46268b8534bSLuigi Rizzofile descriptor to a ring pair, always within the same device.
463*ce3ee1e7SLuigi Rizzo.Pp
464*ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a
465*ce3ee1e7SLuigi Rizzo.Xr vale 4
466*ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default,
467*ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
46868b8534bSLuigi Rizzo.It Dv NIOCTXSYNC
46968b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the
47068b8534bSLuigi Rizzonumber of slots available for transmission.
47168b8534bSLuigi Rizzo.It Dv NIOCRXSYNC
47268b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available
47368b8534bSLuigi Rizzopackets.
47468b8534bSLuigi Rizzo.El
47513a5d88fSLuigi Rizzo.Sh SYSTEM CALLS
47668b8534bSLuigi Rizzo.Nm
47768b8534bSLuigi Rizzouses
478*ce3ee1e7SLuigi Rizzo.Xr select 2
47968b8534bSLuigi Rizzoand
480*ce3ee1e7SLuigi Rizzo.Xr poll 2
481*ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and
482*ce3ee1e7SLuigi Rizzo.Xr mmap 2
483*ce3ee1e7SLuigi Rizzoto map memory.
484*ce3ee1e7SLuigi Rizzo.Pp
485*ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to
486*ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard
487*ce3ee1e7SLuigi RizzoOS primitives, see
488*ce3ee1e7SLuigi Rizzo.Xr pthread 3 .
489*ce3ee1e7SLuigi RizzoIn particular,
490*ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3
491*ce3ee1e7SLuigi Rizzomay be of use.
49268b8534bSLuigi Rizzo.Sh EXAMPLES
49368b8534bSLuigi RizzoThe following code implements a traffic generator
49468b8534bSLuigi Rizzo.Pp
49568b8534bSLuigi Rizzo.Bd -literal -compact
49668b8534bSLuigi Rizzo#include <net/netmap.h>
49768b8534bSLuigi Rizzo#include <net/netmap_user.h>
49868b8534bSLuigi Rizzostruct netmap_if *nifp;
49968b8534bSLuigi Rizzostruct netmap_ring *ring;
500d83a410eSHiren Panchasarastruct nmreq nmr;
50168b8534bSLuigi Rizzo
50268b8534bSLuigi Rizzofd = open("/dev/netmap", O_RDWR);
50368b8534bSLuigi Rizzobzero(&nmr, sizeof(nmr));
504d83a410eSHiren Panchasarastrcpy(nmr.nr_name, "ix0");
505*ce3ee1e7SLuigi Rizzonmr.nm_version = NETMAP_API;
506*ce3ee1e7SLuigi Rizzoioctl(fd, NIOCREGIF, &nmr);
507d83a410eSHiren Panchasarap = mmap(0, nmr.nr_memsize, fd);
508*ce3ee1e7SLuigi Rizzonifp = NETMAP_IF(p, nmr.nr_offset);
50968b8534bSLuigi Rizzoring = NETMAP_TXRING(nifp, 0);
51068b8534bSLuigi Rizzofds.fd = fd;
51168b8534bSLuigi Rizzofds.events = POLLOUT;
51268b8534bSLuigi Rizzofor (;;) {
51368b8534bSLuigi Rizzo    poll(list, 1, -1);
51413a5d88fSLuigi Rizzo    for ( ; ring->avail > 0 ; ring->avail--) {
51568b8534bSLuigi Rizzo        i = ring->cur;
51668b8534bSLuigi Rizzo        buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
51768b8534bSLuigi Rizzo        ... prepare packet in buf ...
51868b8534bSLuigi Rizzo        ring->slot[i].len = ... packet length ...
51968b8534bSLuigi Rizzo        ring->cur = NETMAP_RING_NEXT(ring, i);
52068b8534bSLuigi Rizzo    }
52168b8534bSLuigi Rizzo}
52268b8534bSLuigi Rizzo.Ed
52368b8534bSLuigi Rizzo.Sh SUPPORTED INTERFACES
52468b8534bSLuigi Rizzo.Nm
52568b8534bSLuigi Rizzosupports the following interfaces:
52668b8534bSLuigi Rizzo.Xr em 4 ,
52713a5d88fSLuigi Rizzo.Xr igb 4 ,
52868b8534bSLuigi Rizzo.Xr ixgbe 4 ,
52913a5d88fSLuigi Rizzo.Xr lem 4 ,
53013a5d88fSLuigi Rizzo.Xr re 4
53113a5d88fSLuigi Rizzo.Sh SEE ALSO
53213a5d88fSLuigi Rizzo.Xr vale 4
53313a5d88fSLuigi Rizzo.Pp
53413a5d88fSLuigi Rizzohttp://info.iet.unipi.it/~luigi/netmap/
53513a5d88fSLuigi Rizzo.Pp
53613a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework,
53713a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012
53813a5d88fSLuigi Rizzo.Pp
53913a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O,
54013a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston
54168b8534bSLuigi Rizzo.Sh AUTHORS
54213a5d88fSLuigi Rizzo.An -nosplit
54368b8534bSLuigi RizzoThe
54468b8534bSLuigi Rizzo.Nm
545*ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the
54613a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by
54713a5d88fSLuigi Rizzo.An Luigi Rizzo ,
548*ce3ee1e7SLuigi Rizzoand further extended with help from
54913a5d88fSLuigi Rizzo.An Matteo Landi ,
55013a5d88fSLuigi Rizzo.An Gaetano Catalli ,
551*ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri ,
552*ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione .
55313a5d88fSLuigi Rizzo.Pp
55413a5d88fSLuigi Rizzo.Nm
555*ce3ee1e7SLuigi Rizzoand
556*ce3ee1e7SLuigi Rizzo.Nm VALE
557*ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects
558*ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581).
559