xref: /freebsd/share/man/man4/netmap.4 (revision ab0b9f6b3073e6c4d1dfbf07444d7db67a189a96)
1.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd October 18, 2013
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.Sh SYNOPSIS
37.Cd device netmap
38.Sh DESCRIPTION
39.Nm
40is a framework for extremely fast and efficient packet I/O
41(reaching 14.88 Mpps with a single core at less than 1 GHz)
42for both userspace and kernel clients.
43Userspace clients can use the netmap API
44to send and receive raw packets through physical interfaces
45or ports of the
46.Xr VALE 4
47switch.
48.Pp
49.Nm VALE
50is a very fast (reaching 20 Mpps per port)
51and modular software switch,
52implemented within the kernel, which can interconnect
53virtual ports, physical devices, and the native host stack.
54.Pp
55.Nm
56uses a memory mapped region to share packet buffers,
57descriptors and queues with the kernel.
58Simple
59.Pa ioctl()s
60are used to bind interfaces/ports to file descriptors and
61implement non-blocking I/O, whereas blocking I/O uses
62.Pa select()/poll() .
63.Nm
64can exploit the parallelism in multiqueue devices and
65multicore systems.
66.Pp
67For the best performance,
68.Nm
69requires explicit support in device drivers;
70a generic emulation layer is available to implement the
71.Nm
72API on top of unmodified device drivers,
73at the price of reduced performance
74(but still better than what can be achieved with
75sockets or BPF/pcap).
76.Pp
77For a list of devices with native
78.Nm
79support, see the end of this manual page.
80.Sh OPERATION - THE NETMAP API
81.Nm
82clients must first
83.Pa open("/dev/netmap") ,
84and then issue an
85.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg)
86to bind the file descriptor to a specific interface or port.
87.Nm
88has multiple modes of operation controlled by the
89content of the
90.Pa struct nmreq
91passed to the
92.Pa ioctl() .
93In particular, the
94.Em nr_name
95field specifies whether the client operates on a physical network
96interface or on a port of a
97.Nm VALE
98switch, as indicated below. Additional fields in the
99.Pa struct nmreq
100control the details of operation.
101.Bl -tag -width XXXX
102.It Dv Interface name (e.g. 'em0', 'eth1', ... )
103The data path of the interface is disconnected from the host stack.
104Depending on additional arguments,
105the file descriptor is bound to the NIC (one or all queues),
106or to the host stack.
107.It Dv valeXXX:YYY (arbitrary XXX and YYY)
108The file descriptor is bound to port YYY of a VALE switch called XXX,
109where XXX and YYY are arbitrary alphanumeric strings.
110The string cannot exceed IFNAMSIZ characters, and YYY cannot
111matching the name of any existing interface.
112.Pp
113The switch and the port are created if not existing.
114.It Dv valeXXX:ifname (ifname is an existing interface)
115Flags in the argument control whether the physical interface
116(and optionally the corrisponding host stack endpoint)
117are connected or disconnected from the VALE switch named XXX.
118.Pp
119In this case the
120.Pa ioctl()
121is used only for configuring the VALE switch, typically through the
122.Nm vale-ctl
123command.
124The file descriptor cannot be used for I/O, and should be
125.Pa close()d
126after issuing the
127.Pa ioctl().
128.El
129.Pp
130The binding can be removed (and the interface returns to
131regular operation, or the virtual port destroyed) with a
132.Pa close()
133on the file descriptor.
134.Pp
135The processes owning the file descriptor can then
136.Pa mmap()
137the memory region that contains pre-allocated
138buffers, descriptors and queues, and use them to
139read/write raw packets.
140Non blocking I/O is done with special
141.Pa ioctl()'s ,
142whereas the file descriptor can be passed to
143.Pa select()/poll()
144to be notified about incoming packet or available transmit buffers.
145.Ss DATA STRUCTURES
146The data structures in the mmapped memory are described below
147(see
148.Xr sys/net/netmap.h
149for reference).
150All physical devices operating in
151.Nm
152mode use the same memory region,
153shared by the kernel and all processes who own
154.Pa /dev/netmap
155descriptors bound to those devices
156(NOTE: visibility may be restricted in future implementations).
157Virtual ports instead use separate memory regions,
158shared only with the kernel.
159.Pp
160All references between the shared data structure
161are relative (offsets or indexes). Some macros help converting
162them into actual pointers.
163.Bl -tag -width XXX
164.It Dv struct netmap_if (one per interface)
165indicates the number of rings supported by an interface, their
166sizes, and the offsets of the
167.Pa netmap_rings
168associated to the interface.
169.Pp
170.Pa struct netmap_if
171is at offset
172.Pa nr_offset
173in the shared memory region is indicated by the
174field in the structure returned by the
175.Pa NIOCREGIF
176(see below).
177.Bd -literal
178struct netmap_if {
179    char          ni_name[IFNAMSIZ]; /* name of the interface.    */
180    const u_int   ni_version;        /* API version               */
181    const u_int   ni_rx_rings;       /* number of rx ring pairs   */
182    const u_int   ni_tx_rings;       /* if 0, same as ni_rx_rings */
183    const ssize_t ring_ofs[];        /* offset of tx and rx rings */
184};
185.Ed
186.It Dv struct netmap_ring (one per ring)
187Contains the positions in the transmit and receive rings to
188synchronize the kernel and the application,
189and an array of
190.Pa slots
191describing the buffers.
192'reserved' is used in receive rings to tell the kernel the
193number of slots after 'cur' that are still in usr
194indicates how many slots starting from 'cur'
195the
196.Pp
197Each physical interface has one
198.Pa netmap_ring
199for each hardware transmit and receive ring,
200plus one extra transmit and one receive structure
201that connect to the host stack.
202.Bd -literal
203struct netmap_ring {
204    const ssize_t  buf_ofs;   /* see details */
205    const uint32_t num_slots; /* number of slots in the ring */
206    uint32_t       avail;     /* number of usable slots      */
207    uint32_t       cur;       /* 'current' read/write index  */
208    uint32_t       reserved;  /* not refilled before current */
209
210    const uint16_t nr_buf_size;
211    uint16_t       flags;
212#define NR_TIMESTAMP 0x0002   /* set timestamp on *sync()    */
213#define NR_FORWARD   0x0004   /* enable NS_FORWARD for ring  */
214#define NR_RX_TSTMP  0x0008   /* set rx timestamp in slots   */
215    struct timeval ts;
216    struct netmap_slot slot[0]; /* array of slots            */
217}
218.Ed
219.Pp
220In transmit rings, after a system call 'cur' indicates
221the first slot that can be used for transmissions,
222and 'avail' reports how many of them are available.
223Before the next netmap-related system call on the file
224descriptor, the application should fill buffers and
225slots with data, and update 'cur' and 'avail'
226accordingly, as shown in the figure below:
227.Bd -literal
228
229              cur
230               |----- avail ---|   (after syscall)
231               v
232     TX  [*****aaaaaaaaaaaaaaaaa**]
233     TX  [*****TTTTTaaaaaaaaaaaa**]
234                    ^
235                    |-- avail --|   (before syscall)
236                   cur
237.Ed
238In receive rings, after a system call 'cur' indicates
239the first slot that contains a valid packet,
240and 'avail' reports how many of them are available.
241Before the next netmap-related system call on the file
242descriptor, the application can process buffers and
243release them to the kernel updating
244'cur' and 'avail' accordingly, as shown in the figure below.
245Receive rings have an additional field called 'reserved'
246to indicate how many buffers before 'cur' are still
247under processing and cannot be released.
248.Bd -literal
249                 cur
250            |-res-|-- avail --|   (after syscall)
251                  v
252     RX  [**rrrrrrRRRRRRRRRRRR******]
253     RX  [**...........rrrrRRR******]
254                       |res|--|<avail (before syscall)
255                           ^
256                          cur
257
258.Ed
259.It Dv struct netmap_slot (one per packet)
260contains the metadata for a packet:
261.Bd -literal
262struct netmap_slot {
263    uint32_t buf_idx; /* buffer index */
264    uint16_t len;   /* packet length */
265    uint16_t flags; /* buf changed, etc. */
266#define NS_BUF_CHANGED  0x0001  /* must resync, buffer changed */
267#define NS_REPORT       0x0002  /* tell hw to report results
268                                 * e.g. by generating an interrupt
269                                 */
270#define NS_FORWARD      0x0004  /* pass packet to the other endpoint
271                                 * (host stack or device)
272                                 */
273#define NS_NO_LEARN     0x0008
274#define NS_INDIRECT     0x0010
275#define NS_MOREFRAG     0x0020
276#define NS_PORT_SHIFT   8
277#define NS_PORT_MASK    (0xff << NS_PORT_SHIFT)
278#define NS_RFRAGS(_slot)        ( ((_slot)->flags >> 8) & 0xff)
279    uint64_t ptr;   /* buffer address (indirect buffers) */
280};
281.Ed
282The flags control how the the buffer associated to the slot
283should be managed.
284.It Dv packet buffers
285are normally fixed size (2 Kbyte) buffers allocated by the kernel
286that contain packet data. Buffers addresses are computed through
287macros.
288.El
289.Bl -tag -width XXX
290Some macros support the access to objects in the shared memory
291region. In particular,
292.It NETMAP_TXRING(nifp, i)
293.It NETMAP_RXRING(nifp, i)
294return the address of the i-th transmit and receive ring,
295respectively, whereas
296.It NETMAP_BUF(ring, buf_idx)
297returns the address of the buffer with index buf_idx
298(which can be part of any ring for the given interface).
299.El
300.Pp
301Normally, buffers are associated to slots when interfaces are bound,
302and one packet is fully contained in a single buffer.
303Clients can however modify the mapping using the
304following flags:
305.Ss FLAGS
306.Bl -tag -width XXX
307.It NS_BUF_CHANGED
308indicates that the buf_idx in the slot has changed.
309This can be useful if the client wants to implement
310some form of zero-copy forwarding (e.g. by passing buffers
311from an input interface to an output interface), or
312needs to process packets out of order.
313.Pp
314The flag MUST be used whenever the buffer index is changed.
315.It NS_REPORT
316indicates that we want to be woken up when this buffer
317has been transmitted. This reduces performance but insures
318a prompt notification when a buffer has been sent.
319Normally,
320.Nm
321notifies transmit completions in batches, hence signals
322can be delayed indefinitely. However, we need such notifications
323before closing a descriptor.
324.It NS_FORWARD
325When the device is open in 'transparent' mode,
326the client can mark slots in receive rings with this flag.
327For all marked slots, marked packets are forwarded to
328the other endpoint at the next system call, thus restoring
329(in a selective way) the connection between the NIC and the
330host stack.
331.It NS_NO_LEARN
332tells the forwarding code that the SRC MAC address for this
333packet should not be used in the learning bridge
334.It NS_INDIRECT
335indicates that the packet's payload is not in the netmap
336supplied buffer, but in a user-supplied buffer whose
337user virtual address is in the 'ptr' field of the slot.
338The size can reach 65535 bytes.
339.Em This is only supported on the transmit ring of virtual ports
340.It NS_MOREFRAG
341indicates that the packet continues with subsequent buffers;
342the last buffer in a packet must have the flag clear.
343The maximum length of a chain is 64 buffers.
344.Em This is only supported on virtual ports
345.It NS_RFRAGS(slot)
346on receive rings, returns the number of remaining buffers
347in a packet, including this one.
348Slots with a value greater than 1 also have NS_MOREFRAG set.
349The length refers to the individual buffer, there is no
350field for the total length.
351.Pp
352On transmit rings, if NS_DST is set, it is passed to the lookup
353function, which can use it e.g. as the index of the destination
354port instead of doing an address lookup.
355.El
356.Sh IOCTLS
357.Nm
358supports some ioctl() to synchronize the state of the rings
359between the kernel and the user processes, plus some
360to query and configure the interface.
361The former do not require any argument, whereas the latter
362use a
363.Pa struct nmreq
364defined as follows:
365.Bd -literal
366struct nmreq {
367        char      nr_name[IFNAMSIZ];
368        uint32_t  nr_version;     /* API version */
369#define NETMAP_API      4         /* current version */
370        uint32_t  nr_offset;      /* nifp offset in the shared region */
371        uint32_t  nr_memsize;     /* size of the shared region */
372        uint32_t  nr_tx_slots;    /* slots in tx rings */
373        uint32_t  nr_rx_slots;    /* slots in rx rings */
374        uint16_t  nr_tx_rings;    /* number of tx rings */
375        uint16_t  nr_rx_rings;    /* number of tx rings */
376        uint16_t  nr_ringid;      /* ring(s) we care about */
377#define NETMAP_HW_RING  0x4000    /* low bits indicate one hw ring */
378#define NETMAP_SW_RING  0x2000    /* we process the sw ring */
379#define NETMAP_NO_TX_POLL 0x1000  /* no gratuitous txsync on poll */
380#define NETMAP_RING_MASK 0xfff    /* the actual ring number */
381        uint16_t        nr_cmd;
382#define NETMAP_BDG_ATTACH       1       /* attach the NIC */
383#define NETMAP_BDG_DETACH       2       /* detach the NIC */
384#define NETMAP_BDG_LOOKUP_REG   3       /* register lookup function */
385#define NETMAP_BDG_LIST         4       /* get bridge's info */
386	uint16_t	nr_arg1;
387	uint16_t	nr_arg2;
388        uint32_t        spare2[3];
389};
390
391.Ed
392A device descriptor obtained through
393.Pa /dev/netmap
394also supports the ioctl supported by network devices.
395.Pp
396The netmap-specific
397.Xr ioctl 2
398command codes below are defined in
399.In net/netmap.h
400and are:
401.Bl -tag -width XXXX
402.It Dv NIOCGINFO
403returns EINVAL if the named device does not support netmap.
404Otherwise, it returns 0 and (advisory) information
405about the interface.
406Note that all the information below can change before the
407interface is actually put in netmap mode.
408.Pp
409.Pa nr_memsize
410indicates the size of the netmap
411memory region. Physical devices all share the same memory region,
412whereas VALE ports may have independent regions for each port.
413These sizes can be set through system-wise sysctl variables.
414.Pa nr_tx_slots, nr_rx_slots
415indicate the size of transmit and receive rings.
416.Pa nr_tx_rings, nr_rx_rings
417indicate the number of transmit
418and receive rings.
419Both ring number and sizes may be configured at runtime
420using interface-specific functions (e.g.
421.Pa sysctl
422or
423.Pa ethtool .
424.It Dv NIOCREGIF
425puts the interface named in nr_name into netmap mode, disconnecting
426it from the host stack, and/or defines which rings are controlled
427through this file descriptor.
428On return, it gives the same info as NIOCGINFO, and nr_ringid
429indicates the identity of the rings controlled through the file
430descriptor.
431.Pp
432Possible values for nr_ringid are
433.Bl -tag -width XXXXX
434.It 0
435default, all hardware rings
436.It NETMAP_SW_RING
437the ``host rings'' connecting to the host stack
438.It NETMAP_HW_RING + i
439the i-th hardware ring
440.El
441By default, a
442.Nm poll
443or
444.Nm select
445call pushes out any pending packets on the transmit ring, even if
446no write events are specified.
447The feature can be disabled by or-ing
448.Nm NETMAP_NO_TX_SYNC
449to nr_ringid.
450But normally you should keep this feature unless you are using
451separate file descriptors for the send and receive rings, because
452otherwise packets are pushed out only if NETMAP_TXSYNC is called,
453or the send queue is full.
454.Pp
455.Pa NIOCREGIF
456can be used multiple times to change the association of a
457file descriptor to a ring pair, always within the same device.
458.Pp
459When registering a virtual interface that is dynamically created to a
460.Xr vale 4
461switch, we can specify the desired number of rings (1 by default,
462and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
463.It Dv NIOCTXSYNC
464tells the hardware of new packets to transmit, and updates the
465number of slots available for transmission.
466.It Dv NIOCRXSYNC
467tells the hardware of consumed packets, and asks for newly available
468packets.
469.El
470.Sh SYSTEM CALLS
471.Nm
472uses
473.Xr select 2
474and
475.Xr poll 2
476to wake up processes when significant events occur, and
477.Xr mmap 2
478to map memory.
479.Pp
480Applications may need to create threads and bind them to
481specific cores to improve performance, using standard
482OS primitives, see
483.Xr pthread 3 .
484In particular,
485.Xr pthread_setaffinity_np 3
486may be of use.
487.Sh EXAMPLES
488The following code implements a traffic generator
489.Pp
490.Bd -literal -compact
491#include <net/netmap.h>
492#include <net/netmap_user.h>
493struct netmap_if *nifp;
494struct netmap_ring *ring;
495struct nmreq nmr;
496
497fd = open("/dev/netmap", O_RDWR);
498bzero(&nmr, sizeof(nmr));
499strcpy(nmr.nr_name, "ix0");
500nmr.nm_version = NETMAP_API;
501ioctl(fd, NIOCREGIF, &nmr);
502p = mmap(0, nmr.nr_memsize, fd);
503nifp = NETMAP_IF(p, nmr.nr_offset);
504ring = NETMAP_TXRING(nifp, 0);
505fds.fd = fd;
506fds.events = POLLOUT;
507for (;;) {
508    poll(list, 1, -1);
509    for ( ; ring->avail > 0 ; ring->avail--) {
510        i = ring->cur;
511        buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
512        ... prepare packet in buf ...
513        ring->slot[i].len = ... packet length ...
514        ring->cur = NETMAP_RING_NEXT(ring, i);
515    }
516}
517.Ed
518.Sh SUPPORTED INTERFACES
519.Nm
520supports the following interfaces:
521.Xr em 4 ,
522.Xr igb 4 ,
523.Xr ixgbe 4 ,
524.Xr lem 4 ,
525.Xr re 4
526.Sh SEE ALSO
527.Xr vale 4
528.Pp
529http://info.iet.unipi.it/~luigi/netmap/
530.Pp
531Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
532Communications of the ACM, 55 (3), pp.45-51, March 2012
533.Pp
534Luigi Rizzo, netmap: a novel framework for fast packet I/O,
535Usenix ATC'12, June 2012, Boston
536.Sh AUTHORS
537.An -nosplit
538The
539.Nm
540framework has been originally designed and implemented at the
541Universita` di Pisa in 2011 by
542.An Luigi Rizzo ,
543and further extended with help from
544.An Matteo Landi ,
545.An Gaetano Catalli ,
546.An Giuseppe Lettieri ,
547.An Vincenzo Maffione .
548.Pp
549.Nm
550and
551.Nm VALE
552have been funded by the European Commission within FP7 Projects
553CHANGE (257422) and OPENLAB (287581).
554