xref: /freebsd/share/man/man4/netmap.4 (revision 17885a7bfde9d164e45a9833bb172215c55739f9)
1*17885a7bSLuigi Rizzo.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
268b8534bSLuigi Rizzo.\" All rights reserved.
368b8534bSLuigi Rizzo.\"
468b8534bSLuigi Rizzo.\" Redistribution and use in source and binary forms, with or without
568b8534bSLuigi Rizzo.\" modification, are permitted provided that the following conditions
668b8534bSLuigi Rizzo.\" are met:
768b8534bSLuigi Rizzo.\" 1. Redistributions of source code must retain the above copyright
868b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer.
968b8534bSLuigi Rizzo.\" 2. Redistributions in binary form must reproduce the above copyright
1068b8534bSLuigi Rizzo.\"    notice, this list of conditions and the following disclaimer in the
1168b8534bSLuigi Rizzo.\"    documentation and/or other materials provided with the distribution.
1268b8534bSLuigi Rizzo.\"
1368b8534bSLuigi Rizzo.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
1468b8534bSLuigi Rizzo.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
1568b8534bSLuigi Rizzo.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
1668b8534bSLuigi Rizzo.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
1768b8534bSLuigi Rizzo.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
1868b8534bSLuigi Rizzo.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
1968b8534bSLuigi Rizzo.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
2068b8534bSLuigi Rizzo.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
2168b8534bSLuigi Rizzo.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
2268b8534bSLuigi Rizzo.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
2368b8534bSLuigi Rizzo.\" SUCH DAMAGE.
2468b8534bSLuigi Rizzo.\"
2568b8534bSLuigi Rizzo.\" This document is derived in part from the enet man page (enet.4)
2668b8534bSLuigi Rizzo.\" distributed with 4.3BSD Unix.
2768b8534bSLuigi Rizzo.\"
2868b8534bSLuigi Rizzo.\" $FreeBSD$
2968b8534bSLuigi Rizzo.\"
30*17885a7bSLuigi Rizzo.Dd January 4, 2014
3168b8534bSLuigi Rizzo.Dt NETMAP 4
3268b8534bSLuigi Rizzo.Os
3368b8534bSLuigi Rizzo.Sh NAME
3468b8534bSLuigi Rizzo.Nm netmap
3568b8534bSLuigi Rizzo.Nd a framework for fast packet I/O
36*17885a7bSLuigi Rizzo.br
37*17885a7bSLuigi Rizzo.Nm VALE
38*17885a7bSLuigi Rizzo.Nd a fast VirtuAl Local Ethernet using the netmap API
3968b8534bSLuigi Rizzo.Sh SYNOPSIS
4068b8534bSLuigi Rizzo.Cd device netmap
4168b8534bSLuigi Rizzo.Sh DESCRIPTION
4268b8534bSLuigi Rizzo.Nm
43ce3ee1e7SLuigi Rizzois a framework for extremely fast and efficient packet I/O
44ce3ee1e7SLuigi Rizzofor both userspace and kernel clients.
45*17885a7bSLuigi RizzoIt runs on FreeBSD and Linux,
46*17885a7bSLuigi Rizzoand includes
47*17885a7bSLuigi Rizzo.Nm VALE ,
48*17885a7bSLuigi Rizzoa very fast and modular in-kernel software switch/dataplane.
49ce3ee1e7SLuigi Rizzo.Pp
50*17885a7bSLuigi Rizzo.Nm
51*17885a7bSLuigi Rizzoand
52ce3ee1e7SLuigi Rizzo.Nm VALE
53*17885a7bSLuigi Rizzoare one order of magnitude faster than sockets, bpf or
54*17885a7bSLuigi Rizzonative switches based on
55*17885a7bSLuigi Rizzo.Xr tun/tap 4 ,
56*17885a7bSLuigi Rizzoreaching 14.88 Mpps with much less than one core on a 10 Gbit NIC,
57*17885a7bSLuigi Rizzoand 20 Mpps per core for VALE ports.
58ce3ee1e7SLuigi Rizzo.Pp
59*17885a7bSLuigi RizzoUserspace clients can dynamically switch NICs into
6068b8534bSLuigi Rizzo.Nm
61*17885a7bSLuigi Rizzomode and send and receive raw packets through
62*17885a7bSLuigi Rizzomemory mapped buffers.
63*17885a7bSLuigi RizzoA selectable file descriptor supports
64*17885a7bSLuigi Rizzosynchronization and blocking I/O.
6568b8534bSLuigi Rizzo.Pp
66*17885a7bSLuigi RizzoSimilarly,
67*17885a7bSLuigi Rizzo.Nm VALE
68*17885a7bSLuigi Rizzocan dynamically create switch instances and ports,
69*17885a7bSLuigi Rizzoproviding high speed packet I/O between processes,
70*17885a7bSLuigi Rizzovirtual machines, NICs and the host stack.
71*17885a7bSLuigi Rizzo.Pp
72*17885a7bSLuigi RizzoFor best performance,
7368b8534bSLuigi Rizzo.Nm
74ce3ee1e7SLuigi Rizzorequires explicit support in device drivers;
75*17885a7bSLuigi Rizzohowever, the
7668b8534bSLuigi Rizzo.Nm
77*17885a7bSLuigi RizzoAPI can be emulated on top of unmodified device drivers,
78ce3ee1e7SLuigi Rizzoat the price of reduced performance
79*17885a7bSLuigi Rizzo(but still better than sockets or BPF/pcap).
80ce3ee1e7SLuigi Rizzo.Pp
81*17885a7bSLuigi RizzoIn the rest of this (long) manual page we document
82*17885a7bSLuigi Rizzovarious aspects of the
83ce3ee1e7SLuigi Rizzo.Nm
84*17885a7bSLuigi Rizzoand
85ce3ee1e7SLuigi Rizzo.Nm VALE
86*17885a7bSLuigi Rizzoarchitecture, features and usage.
87ce3ee1e7SLuigi Rizzo.Pp
88*17885a7bSLuigi Rizzo.Sh ARCHITECTURE
89*17885a7bSLuigi Rizzo.Nm
90*17885a7bSLuigi Rizzosupports raw packet I/O through a
91*17885a7bSLuigi Rizzo.Em port ,
92*17885a7bSLuigi Rizzowhich can be connected to a physical interface
93*17885a7bSLuigi Rizzo.Em ( NIC ) ,
94*17885a7bSLuigi Rizzoto the host stack,
95*17885a7bSLuigi Rizzoor to a
96*17885a7bSLuigi Rizzo.Nm VALE
97*17885a7bSLuigi Rizzoswitch).
98*17885a7bSLuigi RizzoPorts use preallocated circular queues of buffers
99*17885a7bSLuigi Rizzo.Em ( rings )
100*17885a7bSLuigi Rizzoresiding in an mmapped region.
101*17885a7bSLuigi RizzoThere is one ring for each transmit/receive queue of a
102*17885a7bSLuigi RizzoNIC or virtual port.
103*17885a7bSLuigi RizzoAn additional ring pair connects to the host stack.
104ce3ee1e7SLuigi Rizzo.Pp
105*17885a7bSLuigi RizzoAfter binding a file descriptor to a port, a
106*17885a7bSLuigi Rizzo.Nm
107*17885a7bSLuigi Rizzoclient can send or receive packets in batches through
108*17885a7bSLuigi Rizzothe rings, and possibly implement zero-copy forwarding
109*17885a7bSLuigi Rizzobetween ports.
110ce3ee1e7SLuigi Rizzo.Pp
111*17885a7bSLuigi RizzoAll NICs operating in
11268b8534bSLuigi Rizzo.Nm
113ce3ee1e7SLuigi Rizzomode use the same memory region,
114*17885a7bSLuigi Rizzoaccessible to all processes who own
115*17885a7bSLuigi Rizzo.Nm /dev/netmap
116*17885a7bSLuigi Rizzofile descriptors bound to NICs.
117*17885a7bSLuigi Rizzo.Nm VALE
118*17885a7bSLuigi Rizzoports instead use separate memory regions.
119ce3ee1e7SLuigi Rizzo.Pp
120*17885a7bSLuigi Rizzo.Sh ENTERING AND EXITING NETMAP MODE
121*17885a7bSLuigi RizzoPorts and rings are created and controlled through a file descriptor,
122*17885a7bSLuigi Rizzocreated by opening a special device
123*17885a7bSLuigi Rizzo.Dl fd = open("/dev/netmap");
124*17885a7bSLuigi Rizzoand then bound to a specific port with an
125*17885a7bSLuigi Rizzo.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
126*17885a7bSLuigi Rizzo.Pp
127*17885a7bSLuigi Rizzo.Nm
128*17885a7bSLuigi Rizzohas multiple modes of operation controlled by the
129*17885a7bSLuigi Rizzo.Vt struct nmreq
130*17885a7bSLuigi Rizzoargument.
131*17885a7bSLuigi Rizzo.Va arg.nr_name
132*17885a7bSLuigi Rizzospecifies the port name, as follows:
133*17885a7bSLuigi Rizzo.Bl -tag -width XXXX
134*17885a7bSLuigi Rizzo.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
135*17885a7bSLuigi Rizzothe data path of the NIC is disconnected from the host stack,
136*17885a7bSLuigi Rizzoand the file descriptor is bound to the NIC (one or all queues),
137*17885a7bSLuigi Rizzoor to the host stack;
138*17885a7bSLuigi Rizzo.It Dv valeXXX:YYY (arbitrary XXX and YYY)
139*17885a7bSLuigi Rizzothe file descriptor is bound to port YYY of a VALE switch called XXX,
140*17885a7bSLuigi Rizzoboth dynamically created if necessary.
141*17885a7bSLuigi RizzoThe string cannot exceed IFNAMSIZ characters, and YYY cannot
142*17885a7bSLuigi Rizzobe the name of any existing OS network interface.
143*17885a7bSLuigi Rizzo.El
144*17885a7bSLuigi Rizzo.Pp
145*17885a7bSLuigi RizzoOn return,
146*17885a7bSLuigi Rizzo.Va arg
147*17885a7bSLuigi Rizzoindicates the size of the shared memory region,
148*17885a7bSLuigi Rizzoand the number, size and location of all the
149*17885a7bSLuigi Rizzo.Nm
150*17885a7bSLuigi Rizzodata structures, which can be accessed by mmapping the memory
151*17885a7bSLuigi Rizzo.Dl char *mem = mmap(0, arg.nr_memsize, fd);
152*17885a7bSLuigi Rizzo.Pp
153*17885a7bSLuigi RizzoNon blocking I/O is done with special
154*17885a7bSLuigi Rizzo.Xr ioctl 2
155*17885a7bSLuigi Rizzo.Xr select 2
156*17885a7bSLuigi Rizzoand
157*17885a7bSLuigi Rizzo.Xr poll 2
158*17885a7bSLuigi Rizzoon the file descriptor permit blocking I/O.
159*17885a7bSLuigi Rizzo.Xr epoll 2
160*17885a7bSLuigi Rizzoand
161*17885a7bSLuigi Rizzo.Xr kqueue 2
162*17885a7bSLuigi Rizzoare not supported on
163*17885a7bSLuigi Rizzo.Nm
164*17885a7bSLuigi Rizzofile descriptors.
165*17885a7bSLuigi Rizzo.Pp
166*17885a7bSLuigi RizzoWhile a NIC is in
167*17885a7bSLuigi Rizzo.Nm
168*17885a7bSLuigi Rizzomode, the OS will still believe the interface is up and running.
169*17885a7bSLuigi RizzoOS-generated packets for that NIC end up into a
170*17885a7bSLuigi Rizzo.Nm
171*17885a7bSLuigi Rizzoring, and another ring is used to send packets into the OS network stack.
172*17885a7bSLuigi RizzoA
173*17885a7bSLuigi Rizzo.Xr close 2
174*17885a7bSLuigi Rizzoon the file descriptor removes the binding,
175*17885a7bSLuigi Rizzoand returns the NIC to normal mode (reconnecting the data path
176*17885a7bSLuigi Rizzoto the host stack), or destroys the virtual port.
177*17885a7bSLuigi Rizzo.Pp
178*17885a7bSLuigi Rizzo.Sh DATA STRUCTURES
179*17885a7bSLuigi RizzoThe data structures in the mmapped memory region are detailed in
180*17885a7bSLuigi Rizzo.Xr sys/net/netmap.h ,
181*17885a7bSLuigi Rizzowhich is the ultimate reference for the
182*17885a7bSLuigi Rizzo.Nm
183*17885a7bSLuigi RizzoAPI. The main structures and fields are indicated below:
18468b8534bSLuigi Rizzo.Bl -tag -width XXX
18568b8534bSLuigi Rizzo.It Dv struct netmap_if (one per interface)
18668b8534bSLuigi Rizzo.Bd -literal
18768b8534bSLuigi Rizzostruct netmap_if {
188*17885a7bSLuigi Rizzo    ...
189*17885a7bSLuigi Rizzo    const uint32_t   ni_flags;          /* properties     */
190*17885a7bSLuigi Rizzo    ...
191*17885a7bSLuigi Rizzo    const uint32_t   ni_tx_rings;       /* NIC tx rings   */
192*17885a7bSLuigi Rizzo    const uint32_t   ni_rx_rings;       /* NIC rx rings   */
193*17885a7bSLuigi Rizzo    const uint32_t   ni_extra_tx_rings; /* extra tx rings */
194*17885a7bSLuigi Rizzo    const uint32_t   ni_extra_rx_rings; /* extra rx rings */
195*17885a7bSLuigi Rizzo    ...
19668b8534bSLuigi Rizzo};
19768b8534bSLuigi Rizzo.Ed
198ce3ee1e7SLuigi Rizzo.Pp
199*17885a7bSLuigi RizzoIndicates the number of available rings
200*17885a7bSLuigi Rizzo.Pa ( struct netmap_rings )
201*17885a7bSLuigi Rizzoand their position in the mmapped region.
202*17885a7bSLuigi RizzoThe number of tx and rx rings
203*17885a7bSLuigi Rizzo.Pa ( ni_tx_rings , ni_rx_rings )
204*17885a7bSLuigi Rizzonormally depends on the hardware.
205*17885a7bSLuigi RizzoNICs also have an extra tx/rx ring pair connected to the host stack.
206*17885a7bSLuigi Rizzo.Em NIOCREGIF
207*17885a7bSLuigi Rizzocan request additional tx/rx rings,
208*17885a7bSLuigi Rizzoto be used between multiple processes/threads
209*17885a7bSLuigi Rizzoaccessing the same
210*17885a7bSLuigi Rizzo.Nm
211*17885a7bSLuigi Rizzoport.
212*17885a7bSLuigi Rizzo.It Dv struct netmap_ring (one per ring)
21368b8534bSLuigi Rizzo.Bd -literal
21468b8534bSLuigi Rizzostruct netmap_ring {
215*17885a7bSLuigi Rizzo    ...
216*17885a7bSLuigi Rizzo    const uint32_t num_slots;   /* slots in each ring            */
217*17885a7bSLuigi Rizzo    const uint32_t nr_buf_size; /* size of each buffer           */
218*17885a7bSLuigi Rizzo    ...
219*17885a7bSLuigi Rizzo    uint32_t       head;        /* (u) first buf owned by user   */
220*17885a7bSLuigi Rizzo    uint32_t       cur;         /* (u) wakeup position           */
221*17885a7bSLuigi Rizzo    const uint32_t tail;        /* (k) first buf owned by kernel */
222*17885a7bSLuigi Rizzo    ...
223*17885a7bSLuigi Rizzo    uint32_t       flags;
224*17885a7bSLuigi Rizzo    struct timeval ts;          /* (k) time of last rxsync()      */
225*17885a7bSLuigi Rizzo    ...
226ce3ee1e7SLuigi Rizzo    struct netmap_slot slot[0]; /* array of slots                 */
22768b8534bSLuigi Rizzo}
22868b8534bSLuigi Rizzo.Ed
229ce3ee1e7SLuigi Rizzo.Pp
230*17885a7bSLuigi RizzoImplements transmit and receive rings, with read/write
231*17885a7bSLuigi Rizzopointers, metadata and and an array of
232*17885a7bSLuigi Rizzo.Pa slots
233*17885a7bSLuigi Rizzodescribing the buffers.
234*17885a7bSLuigi Rizzo.Pp
235*17885a7bSLuigi Rizzo.It Dv struct netmap_slot (one per buffer)
23668b8534bSLuigi Rizzo.Bd -literal
23768b8534bSLuigi Rizzostruct netmap_slot {
23868b8534bSLuigi Rizzo    uint32_t buf_idx;           /* buffer index                 */
23968b8534bSLuigi Rizzo    uint16_t len;               /* packet length                */
24068b8534bSLuigi Rizzo    uint16_t flags;             /* buf changed, etc.            */
241*17885a7bSLuigi Rizzo    uint64_t ptr;               /* address for indirect buffers */
24268b8534bSLuigi Rizzo};
24368b8534bSLuigi Rizzo.Ed
244*17885a7bSLuigi Rizzo.Pp
245*17885a7bSLuigi RizzoDescribes a packet buffer, which normally is identified by
246*17885a7bSLuigi Rizzoan index and resides in the mmapped region.
24768b8534bSLuigi Rizzo.It Dv packet buffers
248*17885a7bSLuigi RizzoFixed size (normally 2 KB) packet buffers allocated by the kernel.
249ce3ee1e7SLuigi Rizzo.El
250ce3ee1e7SLuigi Rizzo.Pp
251*17885a7bSLuigi RizzoThe offset of the
252*17885a7bSLuigi Rizzo.Pa struct netmap_if
253*17885a7bSLuigi Rizzoin the mmapped region is indicated by the
254*17885a7bSLuigi Rizzo.Pa nr_offset
255*17885a7bSLuigi Rizzofield in the structure returned by
256*17885a7bSLuigi Rizzo.Pa NIOCREGIF .
257*17885a7bSLuigi RizzoFrom there, all other objects are reachable through
258*17885a7bSLuigi Rizzorelative references (offsets or indexes).
259*17885a7bSLuigi RizzoMacros and functions in <net/netmap_user.h>
260*17885a7bSLuigi Rizzohelp converting them into actual pointers:
261*17885a7bSLuigi Rizzo.Pp
262*17885a7bSLuigi Rizzo.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
263*17885a7bSLuigi Rizzo.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
264*17885a7bSLuigi Rizzo.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
265*17885a7bSLuigi Rizzo.Pp
266*17885a7bSLuigi Rizzo.Dl char *buf = NETMAP_BUF(ring, buffer_index);
267*17885a7bSLuigi Rizzo.Sh RINGS, BUFFERS AND DATA I/O
268*17885a7bSLuigi Rizzo.Va Rings
269*17885a7bSLuigi Rizzoare circular queues of packets with three indexes/pointers
270*17885a7bSLuigi Rizzo.Va ( head , cur , tail ) ;
271*17885a7bSLuigi Rizzoone slot is always kept empty.
272*17885a7bSLuigi RizzoThe ring size
273*17885a7bSLuigi Rizzo.Va ( num_slots )
274*17885a7bSLuigi Rizzoshould not be assumed to be a power of two.
275*17885a7bSLuigi Rizzo.br
276*17885a7bSLuigi Rizzo(NOTE: older versions of netmap used head/count format to indicate
277*17885a7bSLuigi Rizzothe content of a ring).
278*17885a7bSLuigi Rizzo.Pp
279*17885a7bSLuigi Rizzo.Va head
280*17885a7bSLuigi Rizzois the first slot available to userspace;
281*17885a7bSLuigi Rizzo.br
282*17885a7bSLuigi Rizzo.Va cur
283*17885a7bSLuigi Rizzois the wakeup point:
284*17885a7bSLuigi Rizzoselect/poll will unblock when
285*17885a7bSLuigi Rizzo.Va tail
286*17885a7bSLuigi Rizzopasses
287*17885a7bSLuigi Rizzo.Va cur ;
288*17885a7bSLuigi Rizzo.br
289*17885a7bSLuigi Rizzo.Va tail
290*17885a7bSLuigi Rizzois the first slot reserved to the kernel.
291*17885a7bSLuigi Rizzo.Pp
292*17885a7bSLuigi RizzoSlot indexes MUST only move forward;
293*17885a7bSLuigi Rizzofor convenience, the function
294*17885a7bSLuigi Rizzo.Dl nm_ring_next(ring, index)
295*17885a7bSLuigi Rizzoreturns the next index modulo the ring size.
296*17885a7bSLuigi Rizzo.Pp
297*17885a7bSLuigi Rizzo.Va head
298*17885a7bSLuigi Rizzoand
299*17885a7bSLuigi Rizzo.Va cur
300*17885a7bSLuigi Rizzoare only modified by the user program;
301*17885a7bSLuigi Rizzo.Va tail
302*17885a7bSLuigi Rizzois only modified by the kernel.
303*17885a7bSLuigi RizzoThe kernel only reads/writes the
304*17885a7bSLuigi Rizzo.Vt struct netmap_ring
305*17885a7bSLuigi Rizzoslots and buffers
306*17885a7bSLuigi Rizzoduring the execution of a netmap-related system call.
307*17885a7bSLuigi RizzoThe only exception are slots (and buffers) in the range
308*17885a7bSLuigi Rizzo.Va tail\  . . . head-1 ,
309*17885a7bSLuigi Rizzothat are explicitly assigned to the kernel.
310*17885a7bSLuigi Rizzo.Pp
311*17885a7bSLuigi Rizzo.Ss TRANSMIT RINGS
312*17885a7bSLuigi RizzoOn transmit rings, after a
313*17885a7bSLuigi Rizzo.Nm
314*17885a7bSLuigi Rizzosystem call, slots in the range
315*17885a7bSLuigi Rizzo.Va head\  . . . tail-1
316*17885a7bSLuigi Rizzoare available for transmission.
317*17885a7bSLuigi RizzoUser code should fill the slots sequentially
318*17885a7bSLuigi Rizzoand advance
319*17885a7bSLuigi Rizzo.Va head
320*17885a7bSLuigi Rizzoand
321*17885a7bSLuigi Rizzo.Va cur
322*17885a7bSLuigi Rizzopast slots ready to transmit.
323*17885a7bSLuigi Rizzo.Va cur
324*17885a7bSLuigi Rizzomay be moved further ahead if the user code needs
325*17885a7bSLuigi Rizzomore slots before further transmissions (see
326*17885a7bSLuigi Rizzo.Sx SCATTER GATHER I/O ) .
327*17885a7bSLuigi Rizzo.Pp
328*17885a7bSLuigi RizzoAt the next NIOCTXSYNC/select()/poll(),
329*17885a7bSLuigi Rizzoslots up to
330*17885a7bSLuigi Rizzo.Va head-1
331*17885a7bSLuigi Rizzoare pushed to the port, and
332*17885a7bSLuigi Rizzo.Va tail
333*17885a7bSLuigi Rizzomay advance if further slots have become available.
334*17885a7bSLuigi RizzoBelow is an example of the evolution of a TX ring:
335*17885a7bSLuigi Rizzo.Pp
336*17885a7bSLuigi Rizzo.Bd -literal
337*17885a7bSLuigi Rizzo    after the syscall, slots between cur and tail are (a)vailable
338*17885a7bSLuigi Rizzo              head=cur   tail
339*17885a7bSLuigi Rizzo               |          |
340*17885a7bSLuigi Rizzo               v          v
341*17885a7bSLuigi Rizzo     TX  [.....aaaaaaaaaaa.............]
342*17885a7bSLuigi Rizzo
343*17885a7bSLuigi Rizzo    user creates new packets to (T)ransmit
344*17885a7bSLuigi Rizzo                head=cur tail
345*17885a7bSLuigi Rizzo                    |     |
346*17885a7bSLuigi Rizzo                    v     v
347*17885a7bSLuigi Rizzo     TX  [.....TTTTTaaaaaa.............]
348*17885a7bSLuigi Rizzo
349*17885a7bSLuigi Rizzo    NIOCTXSYNC/poll()/select() sends packets and reports new slots
350*17885a7bSLuigi Rizzo                head=cur      tail
351*17885a7bSLuigi Rizzo                    |          |
352*17885a7bSLuigi Rizzo                    v          v
353*17885a7bSLuigi Rizzo     TX  [..........aaaaaaaaaaa........]
354*17885a7bSLuigi Rizzo.Ed
355*17885a7bSLuigi Rizzo.Pp
356*17885a7bSLuigi Rizzoselect() and poll() wlll block if there is no space in the ring, i.e.
357*17885a7bSLuigi Rizzo.Dl ring->cur == ring->tail
358*17885a7bSLuigi Rizzoand return when new slots have become available.
359*17885a7bSLuigi Rizzo.Pp
360*17885a7bSLuigi RizzoHigh speed applications may want to amortize the cost of system calls
361*17885a7bSLuigi Rizzoby preparing as many packets as possible before issuing them.
362*17885a7bSLuigi Rizzo.Pp
363*17885a7bSLuigi RizzoA transmit ring with pending transmissions has
364*17885a7bSLuigi Rizzo.Dl ring->head != ring->tail + 1 (modulo the ring size).
365*17885a7bSLuigi RizzoThe function
366*17885a7bSLuigi Rizzo.Va int nm_tx_pending(ring)
367*17885a7bSLuigi Rizzoimplements this test.
368*17885a7bSLuigi Rizzo.Pp
369*17885a7bSLuigi Rizzo.Ss RECEIVE RINGS
370*17885a7bSLuigi RizzoOn receive rings, after a
371*17885a7bSLuigi Rizzo.Nm
372*17885a7bSLuigi Rizzosystem call, the slots in the range
373*17885a7bSLuigi Rizzo.Va head\& . . . tail-1
374*17885a7bSLuigi Rizzocontain received packets.
375*17885a7bSLuigi RizzoUser code should process them and advance
376*17885a7bSLuigi Rizzo.Va head
377*17885a7bSLuigi Rizzoand
378*17885a7bSLuigi Rizzo.Va cur
379*17885a7bSLuigi Rizzopast slots it wants to return to the kernel.
380*17885a7bSLuigi Rizzo.Va cur
381*17885a7bSLuigi Rizzomay be moved further ahead if the user code wants to
382*17885a7bSLuigi Rizzowait for more packets
383*17885a7bSLuigi Rizzowithout returning all the previous slots to the kernel.
384*17885a7bSLuigi Rizzo.Pp
385*17885a7bSLuigi RizzoAt the next NIOCRXSYNC/select()/poll(),
386*17885a7bSLuigi Rizzoslots up to
387*17885a7bSLuigi Rizzo.Va head-1
388*17885a7bSLuigi Rizzoare returned to the kernel for further receives, and
389*17885a7bSLuigi Rizzo.Va tail
390*17885a7bSLuigi Rizzomay advance to report new incoming packets.
391*17885a7bSLuigi Rizzo.br
392*17885a7bSLuigi RizzoBelow is an example of the evolution of an RX ring:
393*17885a7bSLuigi Rizzo.Bd -literal
394*17885a7bSLuigi Rizzo    after the syscall, there are some (h)eld and some (R)eceived slots
395*17885a7bSLuigi Rizzo           head  cur     tail
396*17885a7bSLuigi Rizzo            |     |       |
397*17885a7bSLuigi Rizzo            v     v       v
398*17885a7bSLuigi Rizzo     RX  [..hhhhhhRRRRRRRR..........]
399*17885a7bSLuigi Rizzo
400*17885a7bSLuigi Rizzo    user advances head and cur, releasing some slots and holding others
401*17885a7bSLuigi Rizzo               head cur  tail
402*17885a7bSLuigi Rizzo                 |  |     |
403*17885a7bSLuigi Rizzo                 v  v     v
404*17885a7bSLuigi Rizzo     RX  [..*****hhhRRRRRR...........]
405*17885a7bSLuigi Rizzo
406*17885a7bSLuigi Rizzo    NICRXSYNC/poll()/select() recovers slots and reports new packets
407*17885a7bSLuigi Rizzo               head cur        tail
408*17885a7bSLuigi Rizzo                 |  |           |
409*17885a7bSLuigi Rizzo                 v  v           v
410*17885a7bSLuigi Rizzo     RX  [.......hhhRRRRRRRRRRRR....]
411*17885a7bSLuigi Rizzo.Ed
412*17885a7bSLuigi Rizzo.Pp
413*17885a7bSLuigi Rizzo.Sh SLOTS AND PACKET BUFFERS
414*17885a7bSLuigi RizzoNormally, packets should be stored in the netmap-allocated buffers
415*17885a7bSLuigi Rizzoassigned to slots when ports are bound to a file descriptor.
416*17885a7bSLuigi RizzoOne packet is fully contained in a single buffer.
417*17885a7bSLuigi Rizzo.Pp
418*17885a7bSLuigi RizzoThe following flags affect slot and buffer processing:
419ce3ee1e7SLuigi Rizzo.Bl -tag -width XXX
420ce3ee1e7SLuigi Rizzo.It NS_BUF_CHANGED
421*17885a7bSLuigi Rizzoit MUST be used when the buf_idx in the slot is changed.
422*17885a7bSLuigi RizzoThis can be used to implement
423*17885a7bSLuigi Rizzozero-copy forwarding, see
424*17885a7bSLuigi Rizzo.Sx ZERO-COPY FORWARDING .
425ce3ee1e7SLuigi Rizzo.Pp
426ce3ee1e7SLuigi Rizzo.It NS_REPORT
427*17885a7bSLuigi Rizzoreports when this buffer has been transmitted.
428ce3ee1e7SLuigi RizzoNormally,
429ce3ee1e7SLuigi Rizzo.Nm
430ce3ee1e7SLuigi Rizzonotifies transmit completions in batches, hence signals
431*17885a7bSLuigi Rizzocan be delayed indefinitely. This flag helps detecting
432*17885a7bSLuigi Rizzowhen packets have been send and a file descriptor can be closed.
433ce3ee1e7SLuigi Rizzo.It NS_FORWARD
434*17885a7bSLuigi RizzoWhen a ring is in 'transparent' mode (see
435*17885a7bSLuigi Rizzo.Sx TRANSPARENT MODE ) ,
436*17885a7bSLuigi Rizzopackets marked with this flags are forwarded to the other endpoint
437*17885a7bSLuigi Rizzoat the next system call, thus restoring (in a selective way)
438*17885a7bSLuigi Rizzothe connection between a NIC and the host stack.
439ce3ee1e7SLuigi Rizzo.It NS_NO_LEARN
440ce3ee1e7SLuigi Rizzotells the forwarding code that the SRC MAC address for this
441*17885a7bSLuigi Rizzopacket must not be used in the learning bridge code.
442ce3ee1e7SLuigi Rizzo.It NS_INDIRECT
443*17885a7bSLuigi Rizzoindicates that the packet's payload is in a user-supplied buffer,
444*17885a7bSLuigi Rizzowhose user virtual address is in the 'ptr' field of the slot.
445ce3ee1e7SLuigi RizzoThe size can reach 65535 bytes.
446*17885a7bSLuigi Rizzo.br
447*17885a7bSLuigi RizzoThis is only supported on the transmit ring of
448*17885a7bSLuigi Rizzo.Nm VALE
449*17885a7bSLuigi Rizzoports, and it helps reducing data copies in the interconnection
450*17885a7bSLuigi Rizzoof virtual machines.
451ce3ee1e7SLuigi Rizzo.It NS_MOREFRAG
452ce3ee1e7SLuigi Rizzoindicates that the packet continues with subsequent buffers;
453ce3ee1e7SLuigi Rizzothe last buffer in a packet must have the flag clear.
454ce3ee1e7SLuigi Rizzo.El
455*17885a7bSLuigi Rizzo.Sh SCATTER GATHER I/O
456*17885a7bSLuigi RizzoPackets can span multiple slots if the
457*17885a7bSLuigi Rizzo.Va NS_MOREFRAG
458*17885a7bSLuigi Rizzoflag is set in all but the last slot.
459*17885a7bSLuigi RizzoThe maximum length of a chain is 64 buffers.
460*17885a7bSLuigi RizzoThis is normally used with
461*17885a7bSLuigi Rizzo.Nm VALE
462*17885a7bSLuigi Rizzoports when connecting virtual machines, as they generate large
463*17885a7bSLuigi RizzoTSO segments that are not split unless they reach a physical device.
464*17885a7bSLuigi Rizzo.Pp
465*17885a7bSLuigi RizzoNOTE: The length field always refers to the individual
466*17885a7bSLuigi Rizzofragment; there is no place with the total length of a packet.
467*17885a7bSLuigi Rizzo.Pp
468*17885a7bSLuigi RizzoOn receive rings the macro
469*17885a7bSLuigi Rizzo.Va NS_RFRAGS(slot)
470*17885a7bSLuigi Rizzoindicates the remaining number of slots for this packet,
471*17885a7bSLuigi Rizzoincluding the current one.
472*17885a7bSLuigi RizzoSlots with a value greater than 1 also have NS_MOREFRAG set.
47313a5d88fSLuigi Rizzo.Sh IOCTLS
47468b8534bSLuigi Rizzo.Nm
475*17885a7bSLuigi Rizzouses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
476*17885a7bSLuigi Rizzofor non-blocking I/O. They take no argument.
477*17885a7bSLuigi RizzoTwo more ioctls (NIOCGINFO, NIOCREGIF) are used
478*17885a7bSLuigi Rizzoto query and configure ports, with the following argument:
47968b8534bSLuigi Rizzo.Bd -literal
48068b8534bSLuigi Rizzostruct nmreq {
481*17885a7bSLuigi Rizzo    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
482*17885a7bSLuigi Rizzo    uint32_t  nr_version;        /* (i) API version                */
483*17885a7bSLuigi Rizzo    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
484*17885a7bSLuigi Rizzo    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
485*17885a7bSLuigi Rizzo    uint32_t  nr_tx_slots;       /* (o) slots in tx rings          */
486*17885a7bSLuigi Rizzo    uint32_t  nr_rx_slots;       /* (o) slots in rx rings          */
487*17885a7bSLuigi Rizzo    uint16_t  nr_tx_rings;       /* (o) number of tx rings         */
488*17885a7bSLuigi Rizzo    uint16_t  nr_rx_rings;       /* (o) number of tx rings         */
489*17885a7bSLuigi Rizzo    uint16_t  nr_ringid;         /* (i) ring(s) we care about      */
490*17885a7bSLuigi Rizzo    uint16_t  nr_cmd;            /* (i) special command            */
491*17885a7bSLuigi Rizzo    uint16_t  nr_arg1;           /* (i) extra arguments            */
492*17885a7bSLuigi Rizzo    uint16_t  nr_arg2;           /* (i) extra arguments            */
493*17885a7bSLuigi Rizzo    ...
49468b8534bSLuigi Rizzo};
49568b8534bSLuigi Rizzo.Ed
49668b8534bSLuigi Rizzo.Pp
497*17885a7bSLuigi RizzoA file descriptor obtained through
498*17885a7bSLuigi Rizzo.Pa /dev/netmap
499*17885a7bSLuigi Rizzoalso supports the ioctl supported by network devices, see
500*17885a7bSLuigi Rizzo.Xr netintro 4 .
501*17885a7bSLuigi Rizzo.Pp
50268b8534bSLuigi Rizzo.Bl -tag -width XXXX
50368b8534bSLuigi Rizzo.It Dv NIOCGINFO
504*17885a7bSLuigi Rizzoreturns EINVAL if the named port does not support netmap.
505ce3ee1e7SLuigi RizzoOtherwise, it returns 0 and (advisory) information
506*17885a7bSLuigi Rizzoabout the port.
507ce3ee1e7SLuigi RizzoNote that all the information below can change before the
508ce3ee1e7SLuigi Rizzointerface is actually put in netmap mode.
50968b8534bSLuigi Rizzo.Pp
510*17885a7bSLuigi Rizzo.Bl -tag -width XX
511*17885a7bSLuigi Rizzo.It Pa nr_memsize
512*17885a7bSLuigi Rizzoindicates the size of the
513*17885a7bSLuigi Rizzo.Nm
514*17885a7bSLuigi Rizzomemory region. NICs in
515*17885a7bSLuigi Rizzo.Nm
516*17885a7bSLuigi Rizzomode all share the same memory region,
517*17885a7bSLuigi Rizzowhereas
518*17885a7bSLuigi Rizzo.Nm VALE
519*17885a7bSLuigi Rizzoports have independent regions for each port.
520*17885a7bSLuigi Rizzo.It Pa nr_tx_slots , nr_rx_slots
521ce3ee1e7SLuigi Rizzoindicate the size of transmit and receive rings.
522*17885a7bSLuigi Rizzo.It Pa nr_tx_rings , nr_rx_rings
523ce3ee1e7SLuigi Rizzoindicate the number of transmit
524ce3ee1e7SLuigi Rizzoand receive rings.
525ce3ee1e7SLuigi RizzoBoth ring number and sizes may be configured at runtime
526ce3ee1e7SLuigi Rizzousing interface-specific functions (e.g.
527*17885a7bSLuigi Rizzo.Xr ethtool
528*17885a7bSLuigi Rizzo).
529*17885a7bSLuigi Rizzo.El
53068b8534bSLuigi Rizzo.It Dv NIOCREGIF
531*17885a7bSLuigi Rizzobinds the port named in
532*17885a7bSLuigi Rizzo.Va nr_name
533*17885a7bSLuigi Rizzoto the file descriptor. For a physical device this also switches it into
534*17885a7bSLuigi Rizzo.Nm
535*17885a7bSLuigi Rizzomode, disconnecting
536*17885a7bSLuigi Rizzoit from the host stack.
537*17885a7bSLuigi RizzoMultiple file descriptors can be bound to the same port,
538*17885a7bSLuigi Rizzowith proper synchronization left to the user.
539*17885a7bSLuigi Rizzo.Pp
54068b8534bSLuigi RizzoOn return, it gives the same info as NIOCGINFO, and nr_ringid
54168b8534bSLuigi Rizzoindicates the identity of the rings controlled through the file
54268b8534bSLuigi Rizzodescriptor.
54368b8534bSLuigi Rizzo.Pp
544*17885a7bSLuigi Rizzo.Va nr_ringid
545*17885a7bSLuigi Rizzoselects which rings are controlled through this file descriptor.
546*17885a7bSLuigi RizzoPossible values are:
54768b8534bSLuigi Rizzo.Bl -tag -width XXXXX
54868b8534bSLuigi Rizzo.It 0
549*17885a7bSLuigi Rizzo(default) all hardware rings
55068b8534bSLuigi Rizzo.It NETMAP_SW_RING
551*17885a7bSLuigi Rizzothe ``host rings'', connecting to the host stack.
552*17885a7bSLuigi Rizzo.It NETMAP_HW_RING | i
553*17885a7bSLuigi Rizzothe i-th hardware ring .
55468b8534bSLuigi Rizzo.El
555*17885a7bSLuigi Rizzo.Pp
55668b8534bSLuigi RizzoBy default, a
557*17885a7bSLuigi Rizzo.Xr poll 2
55868b8534bSLuigi Rizzoor
559*17885a7bSLuigi Rizzo.Xr select 2
56068b8534bSLuigi Rizzocall pushes out any pending packets on the transmit ring, even if
56168b8534bSLuigi Rizzono write events are specified.
56268b8534bSLuigi RizzoThe feature can be disabled by or-ing
563*17885a7bSLuigi Rizzo.Va NETMAP_NO_TX_SYNC
564*17885a7bSLuigi Rizzoto the value written to
565*17885a7bSLuigi Rizzo.Va nr_ringid.
566*17885a7bSLuigi RizzoWhen this feature is used,
567*17885a7bSLuigi Rizzopackets are transmitted only on
568*17885a7bSLuigi Rizzo.Va ioctl(NIOCTXSYNC)
569*17885a7bSLuigi Rizzoor select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
570ce3ee1e7SLuigi Rizzo.Pp
571ce3ee1e7SLuigi RizzoWhen registering a virtual interface that is dynamically created to a
572ce3ee1e7SLuigi Rizzo.Xr vale 4
573ce3ee1e7SLuigi Rizzoswitch, we can specify the desired number of rings (1 by default,
574ce3ee1e7SLuigi Rizzoand currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
57568b8534bSLuigi Rizzo.It Dv NIOCTXSYNC
57668b8534bSLuigi Rizzotells the hardware of new packets to transmit, and updates the
57768b8534bSLuigi Rizzonumber of slots available for transmission.
57868b8534bSLuigi Rizzo.It Dv NIOCRXSYNC
57968b8534bSLuigi Rizzotells the hardware of consumed packets, and asks for newly available
58068b8534bSLuigi Rizzopackets.
58168b8534bSLuigi Rizzo.El
582*17885a7bSLuigi Rizzo.Sh SELECT AND POLL
583*17885a7bSLuigi Rizzo.Xr select 2
584*17885a7bSLuigi Rizzoand
585*17885a7bSLuigi Rizzo.Xr poll 2
586*17885a7bSLuigi Rizzoon a
587*17885a7bSLuigi Rizzo.Nm
588*17885a7bSLuigi Rizzofile descriptor process rings as indicated in
589*17885a7bSLuigi Rizzo.Sx TRANSMIT RINGS
590*17885a7bSLuigi Rizzoand
591*17885a7bSLuigi Rizzo.Sx RECEIVE RINGS
592*17885a7bSLuigi Rizzowhen write (POLLOUT) and read (POLLIN) events are requested.
593*17885a7bSLuigi Rizzo.Pp
594*17885a7bSLuigi RizzoBoth block if no slots are available in the ring (
595*17885a7bSLuigi Rizzo.Va ring->cur == ring->tail )
596*17885a7bSLuigi Rizzo.Pp
597*17885a7bSLuigi RizzoPackets in transmit rings are normally pushed out even without
598*17885a7bSLuigi Rizzorequesting write events. Passing the NETMAP_NO_TX_SYNC flag to
599*17885a7bSLuigi Rizzo.Em NIOCREGIF
600*17885a7bSLuigi Rizzodisables this feature.
601*17885a7bSLuigi Rizzo.Sh LIBRARIES
602*17885a7bSLuigi RizzoThe
603*17885a7bSLuigi Rizzo.Nm
604*17885a7bSLuigi RizzoAPI is supposed to be used directly, both because of its simplicity and
605*17885a7bSLuigi Rizzofor efficient integration with applications.
606*17885a7bSLuigi Rizzo.Pp
607*17885a7bSLuigi RizzoFor conveniency, the
608*17885a7bSLuigi Rizzo.Va <net/netmap_user.h>
609*17885a7bSLuigi Rizzoheader provides a few macros and functions to ease creating
610*17885a7bSLuigi Rizzoa file descriptor and doing I/O with a
611*17885a7bSLuigi Rizzo.Nm
612*17885a7bSLuigi Rizzoport. These are loosely modeled after the
613*17885a7bSLuigi Rizzo.Xr pcap 3
614*17885a7bSLuigi RizzoAPI, to ease porting of libpcap-based applications to
615*17885a7bSLuigi Rizzo.Nm .
616*17885a7bSLuigi RizzoTo use these extra functions, programs should
617*17885a7bSLuigi Rizzo.Dl #define NETMAP_WITH_LIBS
618*17885a7bSLuigi Rizzobefore
619*17885a7bSLuigi Rizzo.Dl #include <net/netmap_user.h>
620*17885a7bSLuigi Rizzo.Pp
621*17885a7bSLuigi RizzoThe following functions are available:
622*17885a7bSLuigi Rizzo.Bl -tag -width XXXXX
623*17885a7bSLuigi Rizzo.It Va  struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags)
624*17885a7bSLuigi Rizzosimilar to
625*17885a7bSLuigi Rizzo.Xr pcap_open ,
626*17885a7bSLuigi Rizzobinds a file descriptor to a port.
627*17885a7bSLuigi Rizzo.Bl -tag -width XX
628*17885a7bSLuigi Rizzo.It Va ifname
629*17885a7bSLuigi Rizzois a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
630*17885a7bSLuigi Rizzo.Nm VALE
631*17885a7bSLuigi Rizzoport.
632*17885a7bSLuigi Rizzo.It Va flags
633*17885a7bSLuigi Rizzocan be set to
634*17885a7bSLuigi Rizzo.Va NETMAP_SW_RING
635*17885a7bSLuigi Rizzoto bind to the host ring pair,
636*17885a7bSLuigi Rizzoor to NETMAP_HW_RING to bind to a specific ring.
637*17885a7bSLuigi Rizzo.Va ring_name
638*17885a7bSLuigi Rizzowith NETMAP_HW_RING,
639*17885a7bSLuigi Rizzois interpreted as a string or an integer indicating the ring to use.
640*17885a7bSLuigi Rizzo.It Va ring_flags
641*17885a7bSLuigi Rizzois copied directly into the ring flags, to specify additional parameters
642*17885a7bSLuigi Rizzosuch as NR_TIMESTAMP or NR_FORWARD.
643*17885a7bSLuigi Rizzo.El
644*17885a7bSLuigi Rizzo.It Va int nm_close(struct nm_desc_t *d)
645*17885a7bSLuigi Rizzocloses the file descriptor, unmaps memory, frees resources.
646*17885a7bSLuigi Rizzo.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size)
647*17885a7bSLuigi Rizzosimilar to pcap_inject(), pushes a packet to a ring, returns the size
648*17885a7bSLuigi Rizzoof the packet is successful, or 0 on error;
649*17885a7bSLuigi Rizzo.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg)
650*17885a7bSLuigi Rizzosimilar to pcap_dispatch(), applies a callback to incoming packets
651*17885a7bSLuigi Rizzo.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr)
652*17885a7bSLuigi Rizzosimilar to pcap_next(), fetches the next packet
653*17885a7bSLuigi Rizzo.Pp
654*17885a7bSLuigi Rizzo.El
655*17885a7bSLuigi Rizzo.Sh SUPPORTED DEVICES
656*17885a7bSLuigi Rizzo.Nm
657*17885a7bSLuigi Rizzonatively supports the following devices:
658*17885a7bSLuigi Rizzo.Pp
659*17885a7bSLuigi RizzoOn FreeBSD:
660*17885a7bSLuigi Rizzo.Xr em 4 ,
661*17885a7bSLuigi Rizzo.Xr igb 4 ,
662*17885a7bSLuigi Rizzo.Xr ixgbe 4 ,
663*17885a7bSLuigi Rizzo.Xr lem 4 ,
664*17885a7bSLuigi Rizzo.Xr re 4 .
665*17885a7bSLuigi Rizzo.Pp
666*17885a7bSLuigi RizzoOn Linux
667*17885a7bSLuigi Rizzo.Xr e1000 4 ,
668*17885a7bSLuigi Rizzo.Xr e1000e 4 ,
669*17885a7bSLuigi Rizzo.Xr igb 4 ,
670*17885a7bSLuigi Rizzo.Xr ixgbe 4 ,
671*17885a7bSLuigi Rizzo.Xr mlx4 4 ,
672*17885a7bSLuigi Rizzo.Xr forcedeth 4 ,
673*17885a7bSLuigi Rizzo.Xr r8169 4 .
674*17885a7bSLuigi Rizzo.Pp
675*17885a7bSLuigi RizzoNICs without native support can still be used in
676*17885a7bSLuigi Rizzo.Nm
677*17885a7bSLuigi Rizzomode through emulation. Performance is inferior to native netmap
678*17885a7bSLuigi Rizzomode but still significantly higher than sockets, and approaching
679*17885a7bSLuigi Rizzothat of in-kernel solutions such as Linux's
680*17885a7bSLuigi Rizzo.Xr pktgen .
681*17885a7bSLuigi Rizzo.Pp
682*17885a7bSLuigi RizzoEmulation is also available for devices with native netmap support,
683*17885a7bSLuigi Rizzowhich can be used for testing or performance comparison.
684*17885a7bSLuigi RizzoThe sysctl variable
685*17885a7bSLuigi Rizzo.Va dev.netmap.admode
686*17885a7bSLuigi Rizzoglobally controls how netmap mode is implemented.
687*17885a7bSLuigi Rizzo.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
688*17885a7bSLuigi RizzoSome aspect of the operation of
689*17885a7bSLuigi Rizzo.Nm
690*17885a7bSLuigi Rizzoare controlled through sysctl variables on FreeBSD
691*17885a7bSLuigi Rizzo.Em ( dev.netmap.* )
692*17885a7bSLuigi Rizzoand module parameters on Linux
693*17885a7bSLuigi Rizzo.Em ( /sys/module/netmap_lin/parameters/* ) :
694*17885a7bSLuigi Rizzo.Pp
695*17885a7bSLuigi Rizzo.Bl -tag -width indent
696*17885a7bSLuigi Rizzo.It Va dev.netmap.admode: 0
697*17885a7bSLuigi RizzoControls the use of native or emulated adapter mode.
698*17885a7bSLuigi Rizzo0 uses the best available option, 1 forces native and
699*17885a7bSLuigi Rizzofails if not available, 2 forces emulated hence never fails.
700*17885a7bSLuigi Rizzo.It Va dev.netmap.generic_ringsize: 1024
701*17885a7bSLuigi RizzoRing size used for emulated netmap mode
702*17885a7bSLuigi Rizzo.It Va dev.netmap.generic_mit: 100000
703*17885a7bSLuigi RizzoControls interrupt moderation for emulated mode
704*17885a7bSLuigi Rizzo.It Va dev.netmap.mmap_unreg: 0
705*17885a7bSLuigi Rizzo.It Va dev.netmap.fwd: 0
706*17885a7bSLuigi RizzoForces NS_FORWARD mode
707*17885a7bSLuigi Rizzo.It Va dev.netmap.flags: 0
708*17885a7bSLuigi Rizzo.It Va dev.netmap.txsync_retry: 2
709*17885a7bSLuigi Rizzo.It Va dev.netmap.no_pendintr: 1
710*17885a7bSLuigi RizzoForces recovery of transmit buffers on system calls
711*17885a7bSLuigi Rizzo.It Va dev.netmap.mitigate: 1
712*17885a7bSLuigi RizzoPropagates interrupt mitigation to user processes
713*17885a7bSLuigi Rizzo.It Va dev.netmap.no_timestamp: 0
714*17885a7bSLuigi RizzoDisables the update of the timestamp in the netmap ring
715*17885a7bSLuigi Rizzo.It Va dev.netmap.verbose: 0
716*17885a7bSLuigi RizzoVerbose kernel messages
717*17885a7bSLuigi Rizzo.It Va dev.netmap.buf_num: 163840
718*17885a7bSLuigi Rizzo.It Va dev.netmap.buf_size: 2048
719*17885a7bSLuigi Rizzo.It Va dev.netmap.ring_num: 200
720*17885a7bSLuigi Rizzo.It Va dev.netmap.ring_size: 36864
721*17885a7bSLuigi Rizzo.It Va dev.netmap.if_num: 100
722*17885a7bSLuigi Rizzo.It Va dev.netmap.if_size: 1024
723*17885a7bSLuigi RizzoSizes and number of objects (netmap_if, netmap_ring, buffers)
724*17885a7bSLuigi Rizzofor the global memory region. The only parameter worth modifying is
725*17885a7bSLuigi Rizzo.Va dev.netmap.buf_num
726*17885a7bSLuigi Rizzoas it impacts the total amount of memory used by netmap.
727*17885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_num: 0
728*17885a7bSLuigi Rizzo.It Va dev.netmap.buf_curr_size: 0
729*17885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_num: 0
730*17885a7bSLuigi Rizzo.It Va dev.netmap.ring_curr_size: 0
731*17885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_num: 0
732*17885a7bSLuigi Rizzo.It Va dev.netmap.if_curr_size: 0
733*17885a7bSLuigi RizzoActual values in use.
734*17885a7bSLuigi Rizzo.It Va dev.netmap.bridge_batch: 1024
735*17885a7bSLuigi RizzoBatch size used when moving packets across a
736*17885a7bSLuigi Rizzo.Nm VALE
737*17885a7bSLuigi Rizzoswitch. Values above 64 generally guarantee good
738*17885a7bSLuigi Rizzoperformance.
739*17885a7bSLuigi Rizzo.El
74013a5d88fSLuigi Rizzo.Sh SYSTEM CALLS
74168b8534bSLuigi Rizzo.Nm
74268b8534bSLuigi Rizzouses
743ce3ee1e7SLuigi Rizzo.Xr select 2
74468b8534bSLuigi Rizzoand
745ce3ee1e7SLuigi Rizzo.Xr poll 2
746ce3ee1e7SLuigi Rizzoto wake up processes when significant events occur, and
747ce3ee1e7SLuigi Rizzo.Xr mmap 2
748ce3ee1e7SLuigi Rizzoto map memory.
749*17885a7bSLuigi Rizzo.Xr ioctl 2
750*17885a7bSLuigi Rizzois used to configure ports and
751*17885a7bSLuigi Rizzo.Nm VALE switches .
752ce3ee1e7SLuigi Rizzo.Pp
753ce3ee1e7SLuigi RizzoApplications may need to create threads and bind them to
754ce3ee1e7SLuigi Rizzospecific cores to improve performance, using standard
755ce3ee1e7SLuigi RizzoOS primitives, see
756ce3ee1e7SLuigi Rizzo.Xr pthread 3 .
757ce3ee1e7SLuigi RizzoIn particular,
758ce3ee1e7SLuigi Rizzo.Xr pthread_setaffinity_np 3
759ce3ee1e7SLuigi Rizzomay be of use.
760*17885a7bSLuigi Rizzo.Sh CAVEATS
761*17885a7bSLuigi RizzoNo matter how fast the CPU and OS are,
762*17885a7bSLuigi Rizzoachieving line rate on 10G and faster interfaces
763*17885a7bSLuigi Rizzorequires hardware with sufficient performance.
764*17885a7bSLuigi RizzoSeveral NICs are unable to sustain line rate with
765*17885a7bSLuigi Rizzosmall packet sizes. Insufficient PCIe or memory bandwidth
766*17885a7bSLuigi Rizzocan also cause reduced performance.
767*17885a7bSLuigi Rizzo.Pp
768*17885a7bSLuigi RizzoAnother frequent reason for low performance is the use
769*17885a7bSLuigi Rizzoof flow control on the link: a slow receiver can limit
770*17885a7bSLuigi Rizzothe transmit speed.
771*17885a7bSLuigi RizzoBe sure to disable flow control when running high
772*17885a7bSLuigi Rizzospeed experiments.
773*17885a7bSLuigi Rizzo.Pp
774*17885a7bSLuigi Rizzo.Ss SPECIAL NIC FEATURES
775*17885a7bSLuigi Rizzo.Nm
776*17885a7bSLuigi Rizzois orthogonal to some NIC features such as
777*17885a7bSLuigi Rizzomultiqueue, schedulers, packet filters.
778*17885a7bSLuigi Rizzo.Pp
779*17885a7bSLuigi RizzoMultiple transmit and receive rings are supported natively
780*17885a7bSLuigi Rizzoand can be configured with ordinary OS tools,
781*17885a7bSLuigi Rizzosuch as
782*17885a7bSLuigi Rizzo.Xr ethtool
783*17885a7bSLuigi Rizzoor
784*17885a7bSLuigi Rizzodevice-specific sysctl variables.
785*17885a7bSLuigi RizzoThe same goes for Receive Packet Steering (RPS)
786*17885a7bSLuigi Rizzoand filtering of incoming traffic.
787*17885a7bSLuigi Rizzo.Pp
788*17885a7bSLuigi Rizzo.Nm
789*17885a7bSLuigi Rizzo.Em does not use
790*17885a7bSLuigi Rizzofeatures such as
791*17885a7bSLuigi Rizzo.Em checksum offloading , TCP segmentation offloading ,
792*17885a7bSLuigi Rizzo.Em encryption , VLAN encapsulation/decapsulation ,
793*17885a7bSLuigi Rizzoetc. .
794*17885a7bSLuigi RizzoWhen using netmap to exchange packets with the host stack,
795*17885a7bSLuigi Rizzomake sure to disable these features.
79668b8534bSLuigi Rizzo.Sh EXAMPLES
797*17885a7bSLuigi Rizzo.Ss TEST PROGRAMS
798*17885a7bSLuigi Rizzo.Nm
799*17885a7bSLuigi Rizzocomes with a few programs that can be used for testing or
800*17885a7bSLuigi Rizzosimple applications.
801*17885a7bSLuigi RizzoSee the
802*17885a7bSLuigi Rizzo.Va examples/
803*17885a7bSLuigi Rizzodirectory in
804*17885a7bSLuigi Rizzo.Nm
805*17885a7bSLuigi Rizzodistributions, or
806*17885a7bSLuigi Rizzo.Va tools/tools/netmap/
807*17885a7bSLuigi Rizzodirectory in FreeBSD distributions.
808*17885a7bSLuigi Rizzo.Pp
809*17885a7bSLuigi Rizzo.Xr pkt-gen
810*17885a7bSLuigi Rizzois a general purpose traffic source/sink.
811*17885a7bSLuigi Rizzo.Pp
812*17885a7bSLuigi RizzoAs an example
813*17885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f tx -l 60
814*17885a7bSLuigi Rizzocan generate an infinite stream of minimum size packets, and
815*17885a7bSLuigi Rizzo.Dl pkt-gen -i ix0 -f rx
816*17885a7bSLuigi Rizzois a traffic sink.
817*17885a7bSLuigi RizzoBoth print traffic statistics, to help monitor
818*17885a7bSLuigi Rizzohow the system performs.
819*17885a7bSLuigi Rizzo.Pp
820*17885a7bSLuigi Rizzo.Xr pkt-gen
821*17885a7bSLuigi Rizzohas many options can be uses to set packet sizes, addresses,
822*17885a7bSLuigi Rizzorates, and use multiple send/receive threads and cores.
823*17885a7bSLuigi Rizzo.Pp
824*17885a7bSLuigi Rizzo.Xr bridge
825*17885a7bSLuigi Rizzois another test program which interconnects two
826*17885a7bSLuigi Rizzo.Nm
827*17885a7bSLuigi Rizzoports. It can be used for transparent forwarding between
828*17885a7bSLuigi Rizzointerfaces, as in
829*17885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix1
830*17885a7bSLuigi Rizzoor even connect the NIC to the host stack using netmap
831*17885a7bSLuigi Rizzo.Dl bridge -i ix0 -i ix0
832*17885a7bSLuigi Rizzo.Ss USING THE NATIVE API
83368b8534bSLuigi RizzoThe following code implements a traffic generator
83468b8534bSLuigi Rizzo.Pp
83568b8534bSLuigi Rizzo.Bd -literal -compact
83668b8534bSLuigi Rizzo#include <net/netmap_user.h>
837*17885a7bSLuigi Rizzo...
838*17885a7bSLuigi Rizzovoid sender(void)
839*17885a7bSLuigi Rizzo{
84068b8534bSLuigi Rizzo    struct netmap_if *nifp;
84168b8534bSLuigi Rizzo    struct netmap_ring *ring;
842d83a410eSHiren Panchasara    struct nmreq nmr;
843*17885a7bSLuigi Rizzo    struct pollfd fds;
84468b8534bSLuigi Rizzo
84568b8534bSLuigi Rizzo    fd = open("/dev/netmap", O_RDWR);
84668b8534bSLuigi Rizzo    bzero(&nmr, sizeof(nmr));
847d83a410eSHiren Panchasara    strcpy(nmr.nr_name, "ix0");
848ce3ee1e7SLuigi Rizzo    nmr.nm_version = NETMAP_API;
849ce3ee1e7SLuigi Rizzo    ioctl(fd, NIOCREGIF, &nmr);
850d83a410eSHiren Panchasara    p = mmap(0, nmr.nr_memsize, fd);
851ce3ee1e7SLuigi Rizzo    nifp = NETMAP_IF(p, nmr.nr_offset);
85268b8534bSLuigi Rizzo    ring = NETMAP_TXRING(nifp, 0);
85368b8534bSLuigi Rizzo    fds.fd = fd;
85468b8534bSLuigi Rizzo    fds.events = POLLOUT;
85568b8534bSLuigi Rizzo    for (;;) {
856*17885a7bSLuigi Rizzo	poll(&fds, 1, -1);
857*17885a7bSLuigi Rizzo	while (!nm_ring_empty(ring)) {
85868b8534bSLuigi Rizzo	    i = ring->cur;
85968b8534bSLuigi Rizzo	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
86068b8534bSLuigi Rizzo	    ... prepare packet in buf ...
86168b8534bSLuigi Rizzo	    ring->slot[i].len = ... packet length ...
862*17885a7bSLuigi Rizzo	    ring->head = ring->cur = nm_ring_next(ring, i);
863*17885a7bSLuigi Rizzo	}
86468b8534bSLuigi Rizzo    }
86568b8534bSLuigi Rizzo}
86668b8534bSLuigi Rizzo.Ed
867*17885a7bSLuigi Rizzo.Ss HELPER FUNCTIONS
868*17885a7bSLuigi RizzoA simple receiver can be implemented using the helper functions
869*17885a7bSLuigi Rizzo.Bd -literal -compact
870*17885a7bSLuigi Rizzo#define NETMAP_WITH_LIBS
871*17885a7bSLuigi Rizzo#include <net/netmap_user.h>
872*17885a7bSLuigi Rizzo...
873*17885a7bSLuigi Rizzovoid receiver(void)
874*17885a7bSLuigi Rizzo{
875*17885a7bSLuigi Rizzo    struct nm_desc_t *d;
876*17885a7bSLuigi Rizzo    struct pollfd fds;
877*17885a7bSLuigi Rizzo    u_char *buf;
878*17885a7bSLuigi Rizzo    struct nm_hdr_t h;
879*17885a7bSLuigi Rizzo    ...
880*17885a7bSLuigi Rizzo    d = nm_open("netmap:ix0", NULL, 0, 0);
881*17885a7bSLuigi Rizzo    fds.fd = NETMAP_FD(d);
882*17885a7bSLuigi Rizzo    fds.events = POLLIN;
883*17885a7bSLuigi Rizzo    for (;;) {
884*17885a7bSLuigi Rizzo	poll(&fds, 1, -1);
885*17885a7bSLuigi Rizzo        while ( (buf = nm_nextpkt(d, &h)) )
886*17885a7bSLuigi Rizzo	    consume_pkt(buf, h->len);
887*17885a7bSLuigi Rizzo    }
888*17885a7bSLuigi Rizzo    nm_close(d);
889*17885a7bSLuigi Rizzo}
890*17885a7bSLuigi Rizzo.Ed
891*17885a7bSLuigi Rizzo.Ss ZERO-COPY FORWARDING
892*17885a7bSLuigi RizzoSince physical interfaces share the same memory region,
893*17885a7bSLuigi Rizzoit is possible to do packet forwarding between ports
894*17885a7bSLuigi Rizzoswapping buffers. The buffer from the transmit ring is used
895*17885a7bSLuigi Rizzoto replenish the receive ring:
896*17885a7bSLuigi Rizzo.Bd -literal -compact
897*17885a7bSLuigi Rizzo    uint32_t tmp;
898*17885a7bSLuigi Rizzo    struct netmap_slot *src, *dst;
899*17885a7bSLuigi Rizzo    ...
900*17885a7bSLuigi Rizzo    src = &src_ring->slot[rxr->cur];
901*17885a7bSLuigi Rizzo    dst = &dst_ring->slot[txr->cur];
902*17885a7bSLuigi Rizzo    tmp = dst->buf_idx;
903*17885a7bSLuigi Rizzo    dst->buf_idx = src->buf_idx;
904*17885a7bSLuigi Rizzo    dst->len = src->len;
905*17885a7bSLuigi Rizzo    dst->flags = NS_BUF_CHANGED;
906*17885a7bSLuigi Rizzo    src->buf_idx = tmp;
907*17885a7bSLuigi Rizzo    src->flags = NS_BUF_CHANGED;
908*17885a7bSLuigi Rizzo    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
909*17885a7bSLuigi Rizzo    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
910*17885a7bSLuigi Rizzo    ...
911*17885a7bSLuigi Rizzo.Ed
912*17885a7bSLuigi Rizzo.Ss ACCESSING THE HOST STACK
913*17885a7bSLuigi Rizzo.Ss VALE SWITCH
914*17885a7bSLuigi RizzoA simple way to test the performance of a
915*17885a7bSLuigi Rizzo.Nm VALE
916*17885a7bSLuigi Rizzoswitch is to attach a sender and a receiver to it,
917*17885a7bSLuigi Rizzoe.g. running the following in two different terminals:
918*17885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:a -f rx # receiver
919*17885a7bSLuigi Rizzo.Dl pkt-gen -i vale1:b -f tx # sender
920*17885a7bSLuigi Rizzo.Pp
921*17885a7bSLuigi RizzoThe following command attaches an interface and the host stack
922*17885a7bSLuigi Rizzoto a switch:
923*17885a7bSLuigi Rizzo.Dl vale-ctl -h vale2:em0
924*17885a7bSLuigi RizzoOther
92568b8534bSLuigi Rizzo.Nm
926*17885a7bSLuigi Rizzoclients attached to the same switch can now communicate
927*17885a7bSLuigi Rizzowith the network card or the host.
928*17885a7bSLuigi Rizzo.Pp
92913a5d88fSLuigi Rizzo.Sh SEE ALSO
93013a5d88fSLuigi Rizzo.Pp
93113a5d88fSLuigi Rizzohttp://info.iet.unipi.it/~luigi/netmap/
93213a5d88fSLuigi Rizzo.Pp
93313a5d88fSLuigi RizzoLuigi Rizzo, Revisiting network I/O APIs: the netmap framework,
93413a5d88fSLuigi RizzoCommunications of the ACM, 55 (3), pp.45-51, March 2012
93513a5d88fSLuigi Rizzo.Pp
93613a5d88fSLuigi RizzoLuigi Rizzo, netmap: a novel framework for fast packet I/O,
93713a5d88fSLuigi RizzoUsenix ATC'12, June 2012, Boston
93868b8534bSLuigi Rizzo.Sh AUTHORS
93913a5d88fSLuigi Rizzo.An -nosplit
94068b8534bSLuigi RizzoThe
94168b8534bSLuigi Rizzo.Nm
942ce3ee1e7SLuigi Rizzoframework has been originally designed and implemented at the
94313a5d88fSLuigi RizzoUniversita` di Pisa in 2011 by
94413a5d88fSLuigi Rizzo.An Luigi Rizzo ,
945ce3ee1e7SLuigi Rizzoand further extended with help from
94613a5d88fSLuigi Rizzo.An Matteo Landi ,
94713a5d88fSLuigi Rizzo.An Gaetano Catalli ,
948ce3ee1e7SLuigi Rizzo.An Giuseppe Lettieri ,
949ce3ee1e7SLuigi Rizzo.An Vincenzo Maffione .
95013a5d88fSLuigi Rizzo.Pp
95113a5d88fSLuigi Rizzo.Nm
952ce3ee1e7SLuigi Rizzoand
953ce3ee1e7SLuigi Rizzo.Nm VALE
954ce3ee1e7SLuigi Rizzohave been funded by the European Commission within FP7 Projects
955ce3ee1e7SLuigi RizzoCHANGE (257422) and OPENLAB (287581).
956*17885a7bSLuigi Rizzo.Pp
957*17885a7bSLuigi Rizzo.Ss SPECIAL MODES
958*17885a7bSLuigi RizzoWhen the device name has the form
959*17885a7bSLuigi Rizzo.Dl valeXXX:ifname (ifname is an existing interface)
960*17885a7bSLuigi Rizzothe physical interface
961*17885a7bSLuigi Rizzo(and optionally the corrisponding host stack endpoint)
962*17885a7bSLuigi Rizzoare connected or disconnected from the
963*17885a7bSLuigi Rizzo.Nm VALE
964*17885a7bSLuigi Rizzoswitch named XXX.
965*17885a7bSLuigi RizzoIn this case the
966*17885a7bSLuigi Rizzo.Pa ioctl()
967*17885a7bSLuigi Rizzois only used only for configuration, typically through the
968*17885a7bSLuigi Rizzo.Xr vale-ctl
969*17885a7bSLuigi Rizzocommand.
970*17885a7bSLuigi RizzoThe file descriptor cannot be used for I/O, and should be
971*17885a7bSLuigi Rizzoclosed after issuing the
972*17885a7bSLuigi Rizzo.Pa ioctl() .
973