xref: /freebsd/share/man/man4/netmap.4 (revision a18eacbefdfa1085ca3db829e86ece78cd416493)
1.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd October 18, 2013
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.Sh SYNOPSIS
37.Cd device netmap
38.Sh DESCRIPTION
39.Nm
40is a framework for extremely fast and efficient packet I/O
41(reaching 14.88 Mpps with a single core at less than 1 GHz)
42for both userspace and kernel clients.
43Userspace clients can use the netmap API
44to send and receive raw packets through physical interfaces
45or ports of the
46.Xr VALE 4
47switch.
48.Pp
49.Nm VALE
50is a very fast (reaching 20 Mpps per port)
51and modular software switch,
52implemented within the kernel, which can interconnect
53virtual ports, physical devices, and the native host stack.
54.Pp
55.Nm
56uses a memory mapped region to share packet buffers,
57descriptors and queues with the kernel.
58Simple
59.Pa ioctl()s
60are used to bind interfaces/ports to file descriptors and
61implement non-blocking I/O, whereas blocking I/O uses
62.Pa select()/poll() .
63.Nm
64can exploit the parallelism in multiqueue devices and
65multicore systems.
66.Pp
67For the best performance,
68.Nm
69requires explicit support in device drivers;
70a generic emulation layer is available to implement the
71.Nm
72API on top of unmodified device drivers,
73at the price of reduced performance
74(but still better than what can be achieved with
75sockets or BPF/pcap).
76.Pp
77For a list of devices with native
78.Nm
79support, see the end of this manual page.
80.Pp
81.Sh OPERATION - THE NETMAP API
82.Nm
83clients must first
84.Pa open("/dev/netmap") ,
85and then issue an
86.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg)
87to bind the file descriptor to a specific interface or port.
88.Nm
89has multiple modes of operation controlled by the
90content of the
91.Pa struct nmreq
92passed to the
93.Pa ioctl() .
94In particular, the
95.Em nr_name
96field specifies whether the client operates on a physical network
97interface or on a port of a
98.Nm VALE
99switch, as indicated below. Additional fields in the
100.Pa struct nmreq
101control the details of operation.
102.Pp
103.Bl -tag -width XXXX
104.It Dv Interface name (e.g. 'em0', 'eth1', ... )
105The data path of the interface is disconnected from the host stack.
106Depending on additional arguments,
107the file descriptor is bound to the NIC (one or all queues),
108or to the host stack.
109.It Dv valeXXX:YYY (arbitrary XXX and YYY)
110The file descriptor is bound to port YYY of a VALE switch called XXX,
111where XXX and YYY are arbitrary alphanumeric strings.
112The string cannot exceed IFNAMSIZ characters, and YYY cannot
113matching the name of any existing interface.
114.Pp
115The switch and the port are created if not existing.
116.It Dv valeXXX:ifname (ifname is an existing interface)
117Flags in the argument control whether the physical interface
118(and optionally the corrisponding host stack endpoint)
119are connected or disconnected from the VALE switch named XXX.
120.Pp
121In this case the
122.Pa ioctl()
123is used only for configuring the VALE switch, typically through the
124.Nm vale-ctl
125command.
126The file descriptor cannot be used for I/O, and should be
127.Pa close()d
128after issuing the
129.Pa ioctl().
130.El
131.Pp
132The binding can be removed (and the interface returns to
133regular operation, or the virtual port destroyed) with a
134.Pa close()
135on the file descriptor.
136.Pp
137The processes owning the file descriptor can then
138.Pa mmap()
139the memory region that contains pre-allocated
140buffers, descriptors and queues, and use them to
141read/write raw packets.
142Non blocking I/O is done with special
143.Pa ioctl()'s ,
144whereas the file descriptor can be passed to
145.Pa select()/poll()
146to be notified about incoming packet or available transmit buffers.
147.Ss DATA STRUCTURES
148The data structures in the mmapped memory are described below
149(see
150.Xr sys/net/netmap.h
151for reference).
152All physical devices operating in
153.Nm
154mode use the same memory region,
155shared by the kernel and all processes who own
156.Pa /dev/netmap
157descriptors bound to those devices
158(NOTE: visibility may be restricted in future implementations).
159Virtual ports instead use separate memory regions,
160shared only with the kernel.
161.Pp
162All references between the shared data structure
163are relative (offsets or indexes). Some macros help converting
164them into actual pointers.
165.Bl -tag -width XXX
166.It Dv struct netmap_if (one per interface)
167indicates the number of rings supported by an interface, their
168sizes, and the offsets of the
169.Pa netmap_rings
170associated to the interface.
171.Pp
172.Pa struct netmap_if
173is at offset
174.Pa nr_offset
175in the shared memory region is indicated by the
176field in the structure returned by the
177.Pa NIOCREGIF
178(see below).
179.Bd -literal
180struct netmap_if {
181    char          ni_name[IFNAMSIZ]; /* name of the interface.    */
182    const u_int   ni_version;        /* API version               */
183    const u_int   ni_rx_rings;       /* number of rx ring pairs   */
184    const u_int   ni_tx_rings;       /* if 0, same as ni_rx_rings */
185    const ssize_t ring_ofs[];        /* offset of tx and rx rings */
186};
187.Ed
188.It Dv struct netmap_ring (one per ring)
189Contains the positions in the transmit and receive rings to
190synchronize the kernel and the application,
191and an array of
192.Pa slots
193describing the buffers.
194'reserved' is used in receive rings to tell the kernel the
195number of slots after 'cur' that are still in usr
196indicates how many slots starting from 'cur'
197the
198.Pp
199Each physical interface has one
200.Pa netmap_ring
201for each hardware transmit and receive ring,
202plus one extra transmit and one receive structure
203that connect to the host stack.
204.Bd -literal
205struct netmap_ring {
206    const ssize_t  buf_ofs;   /* see details */
207    const uint32_t num_slots; /* number of slots in the ring */
208    uint32_t       avail;     /* number of usable slots      */
209    uint32_t       cur;       /* 'current' read/write index  */
210    uint32_t       reserved;  /* not refilled before current */
211
212    const uint16_t nr_buf_size;
213    uint16_t       flags;
214#define NR_TIMESTAMP 0x0002   /* set timestamp on *sync()    */
215#define NR_FORWARD   0x0004   /* enable NS_FORWARD for ring  */
216#define NR_RX_TSTMP  0x0008   /* set rx timestamp in slots   */
217    struct timeval ts;
218    struct netmap_slot slot[0]; /* array of slots            */
219}
220.Ed
221.Pp
222In transmit rings, after a system call 'cur' indicates
223the first slot that can be used for transmissions,
224and 'avail' reports how many of them are available.
225Before the next netmap-related system call on the file
226descriptor, the application should fill buffers and
227slots with data, and update 'cur' and 'avail'
228accordingly, as shown in the figure below:
229.Bd -literal
230
231              cur
232               |----- avail ---|   (after syscall)
233               v
234     TX  [*****aaaaaaaaaaaaaaaaa**]
235     TX  [*****TTTTTaaaaaaaaaaaa**]
236                    ^
237                    |-- avail --|   (before syscall)
238                   cur
239.Ed
240
241In receive rings, after a system call 'cur' indicates
242the first slot that contains a valid packet,
243and 'avail' reports how many of them are available.
244Before the next netmap-related system call on the file
245descriptor, the application can process buffers and
246release them to the kernel updating
247'cur' and 'avail' accordingly, as shown in the figure below.
248Receive rings have an additional field called 'reserved'
249to indicate how many buffers before 'cur' are still
250under processing and cannot be released.
251.Bd -literal
252                 cur
253            |-res-|-- avail --|   (after syscall)
254                  v
255     RX  [**rrrrrrRRRRRRRRRRRR******]
256     RX  [**...........rrrrRRR******]
257                       |res|--|<avail (before syscall)
258                           ^
259                          cur
260
261.Ed
262.It Dv struct netmap_slot (one per packet)
263contains the metadata for a packet:
264.Bd -literal
265struct netmap_slot {
266    uint32_t buf_idx; /* buffer index */
267    uint16_t len;   /* packet length */
268    uint16_t flags; /* buf changed, etc. */
269#define NS_BUF_CHANGED  0x0001  /* must resync, buffer changed */
270#define NS_REPORT       0x0002  /* tell hw to report results
271                                 * e.g. by generating an interrupt
272                                 */
273#define NS_FORWARD      0x0004  /* pass packet to the other endpoint
274                                 * (host stack or device)
275                                 */
276#define NS_NO_LEARN     0x0008
277#define NS_INDIRECT     0x0010
278#define NS_MOREFRAG     0x0020
279#define NS_PORT_SHIFT   8
280#define NS_PORT_MASK    (0xff << NS_PORT_SHIFT)
281#define NS_RFRAGS(_slot)        ( ((_slot)->flags >> 8) & 0xff)
282    uint64_t ptr;   /* buffer address (indirect buffers) */
283};
284.Ed
285The flags control how the the buffer associated to the slot
286should be managed.
287.It Dv packet buffers
288are normally fixed size (2 Kbyte) buffers allocated by the kernel
289that contain packet data. Buffers addresses are computed through
290macros.
291.El
292.Pp
293.Bl -tag -width XXX
294Some macros support the access to objects in the shared memory
295region. In particular,
296.It NETMAP_TXRING(nifp, i)
297.It NETMAP_RXRING(nifp, i)
298return the address of the i-th transmit and receive ring,
299respectively, whereas
300.It NETMAP_BUF(ring, buf_idx)
301returns the address of the buffer with index buf_idx
302(which can be part of any ring for the given interface).
303.El
304.Pp
305Normally, buffers are associated to slots when interfaces are bound,
306and one packet is fully contained in a single buffer.
307Clients can however modify the mapping using the
308following flags:
309.Ss FLAGS
310.Bl -tag -width XXX
311.It NS_BUF_CHANGED
312indicates that the buf_idx in the slot has changed.
313This can be useful if the client wants to implement
314some form of zero-copy forwarding (e.g. by passing buffers
315from an input interface to an output interface), or
316needs to process packets out of order.
317.Pp
318The flag MUST be used whenever the buffer index is changed.
319.It NS_REPORT
320indicates that we want to be woken up when this buffer
321has been transmitted. This reduces performance but insures
322a prompt notification when a buffer has been sent.
323Normally,
324.Nm
325notifies transmit completions in batches, hence signals
326can be delayed indefinitely. However, we need such notifications
327before closing a descriptor.
328.It NS_FORWARD
329When the device is open in 'transparent' mode,
330the client can mark slots in receive rings with this flag.
331For all marked slots, marked packets are forwarded to
332the other endpoint at the next system call, thus restoring
333(in a selective way) the connection between the NIC and the
334host stack.
335.It NS_NO_LEARN
336tells the forwarding code that the SRC MAC address for this
337packet should not be used in the learning bridge
338.It NS_INDIRECT
339indicates that the packet's payload is not in the netmap
340supplied buffer, but in a user-supplied buffer whose
341user virtual address is in the 'ptr' field of the slot.
342The size can reach 65535 bytes.
343.Em This is only supported on the transmit ring of virtual ports
344.It NS_MOREFRAG
345indicates that the packet continues with subsequent buffers;
346the last buffer in a packet must have the flag clear.
347The maximum length of a chain is 64 buffers.
348.Em This is only supported on virtual ports
349.It NS_RFRAGS(slot)
350on receive rings, returns the number of remaining buffers
351in a packet, including this one.
352Slots with a value greater than 1 also have NS_MOREFRAG set.
353The length refers to the individual buffer, there is no
354field for the total length.
355.Pp
356On transmit rings, if NS_DST is set, it is passed to the lookup
357function, which can use it e.g. as the index of the destination
358port instead of doing an address lookup.
359.El
360.Sh IOCTLS
361.Nm
362supports some ioctl() to synchronize the state of the rings
363between the kernel and the user processes, plus some
364to query and configure the interface.
365The former do not require any argument, whereas the latter
366use a
367.Pa struct nmreq
368defined as follows:
369.Bd -literal
370struct nmreq {
371        char      nr_name[IFNAMSIZ];
372        uint32_t  nr_version;     /* API version */
373#define NETMAP_API      4         /* current version */
374        uint32_t  nr_offset;      /* nifp offset in the shared region */
375        uint32_t  nr_memsize;     /* size of the shared region */
376        uint32_t  nr_tx_slots;    /* slots in tx rings */
377        uint32_t  nr_rx_slots;    /* slots in rx rings */
378        uint16_t  nr_tx_rings;    /* number of tx rings */
379        uint16_t  nr_rx_rings;    /* number of tx rings */
380        uint16_t  nr_ringid;      /* ring(s) we care about */
381#define NETMAP_HW_RING  0x4000    /* low bits indicate one hw ring */
382#define NETMAP_SW_RING  0x2000    /* we process the sw ring */
383#define NETMAP_NO_TX_POLL 0x1000  /* no gratuitous txsync on poll */
384#define NETMAP_RING_MASK 0xfff    /* the actual ring number */
385        uint16_t        nr_cmd;
386#define NETMAP_BDG_ATTACH       1       /* attach the NIC */
387#define NETMAP_BDG_DETACH       2       /* detach the NIC */
388#define NETMAP_BDG_LOOKUP_REG   3       /* register lookup function */
389#define NETMAP_BDG_LIST         4       /* get bridge's info */
390	uint16_t	nr_arg1;
391	uint16_t	nr_arg2;
392        uint32_t        spare2[3];
393};
394
395.Ed
396A device descriptor obtained through
397.Pa /dev/netmap
398also supports the ioctl supported by network devices.
399.Pp
400The netmap-specific
401.Xr ioctl 2
402command codes below are defined in
403.In net/netmap.h
404and are:
405.Bl -tag -width XXXX
406.It Dv NIOCGINFO
407returns EINVAL if the named device does not support netmap.
408Otherwise, it returns 0 and (advisory) information
409about the interface.
410Note that all the information below can change before the
411interface is actually put in netmap mode.
412.Pp
413.Pa nr_memsize
414indicates the size of the netmap
415memory region. Physical devices all share the same memory region,
416whereas VALE ports may have independent regions for each port.
417These sizes can be set through system-wise sysctl variables.
418.Pa nr_tx_slots, nr_rx_slots
419indicate the size of transmit and receive rings.
420.Pa nr_tx_rings, nr_rx_rings
421indicate the number of transmit
422and receive rings.
423Both ring number and sizes may be configured at runtime
424using interface-specific functions (e.g.
425.Pa sysctl
426or
427.Pa ethtool .
428.It Dv NIOCREGIF
429puts the interface named in nr_name into netmap mode, disconnecting
430it from the host stack, and/or defines which rings are controlled
431through this file descriptor.
432On return, it gives the same info as NIOCGINFO, and nr_ringid
433indicates the identity of the rings controlled through the file
434descriptor.
435.Pp
436Possible values for nr_ringid are
437.Bl -tag -width XXXXX
438.It 0
439default, all hardware rings
440.It NETMAP_SW_RING
441the ``host rings'' connecting to the host stack
442.It NETMAP_HW_RING + i
443the i-th hardware ring
444.El
445By default, a
446.Nm poll
447or
448.Nm select
449call pushes out any pending packets on the transmit ring, even if
450no write events are specified.
451The feature can be disabled by or-ing
452.Nm NETMAP_NO_TX_SYNC
453to nr_ringid.
454But normally you should keep this feature unless you are using
455separate file descriptors for the send and receive rings, because
456otherwise packets are pushed out only if NETMAP_TXSYNC is called,
457or the send queue is full.
458.Pp
459.Pa NIOCREGIF
460can be used multiple times to change the association of a
461file descriptor to a ring pair, always within the same device.
462.Pp
463When registering a virtual interface that is dynamically created to a
464.Xr vale 4
465switch, we can specify the desired number of rings (1 by default,
466and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
467.It Dv NIOCTXSYNC
468tells the hardware of new packets to transmit, and updates the
469number of slots available for transmission.
470.It Dv NIOCRXSYNC
471tells the hardware of consumed packets, and asks for newly available
472packets.
473.El
474.Sh SYSTEM CALLS
475.Nm
476uses
477.Xr select 2
478and
479.Xr poll 2
480to wake up processes when significant events occur, and
481.Xr mmap 2
482to map memory.
483.Pp
484Applications may need to create threads and bind them to
485specific cores to improve performance, using standard
486OS primitives, see
487.Xr pthread 3 .
488In particular,
489.Xr pthread_setaffinity_np 3
490may be of use.
491.Sh EXAMPLES
492The following code implements a traffic generator
493.Pp
494.Bd -literal -compact
495#include <net/netmap.h>
496#include <net/netmap_user.h>
497struct netmap_if *nifp;
498struct netmap_ring *ring;
499struct nmreq nmr;
500
501fd = open("/dev/netmap", O_RDWR);
502bzero(&nmr, sizeof(nmr));
503strcpy(nmr.nr_name, "ix0");
504nmr.nm_version = NETMAP_API;
505ioctl(fd, NIOCREGIF, &nmr);
506p = mmap(0, nmr.nr_memsize, fd);
507nifp = NETMAP_IF(p, nmr.nr_offset);
508ring = NETMAP_TXRING(nifp, 0);
509fds.fd = fd;
510fds.events = POLLOUT;
511for (;;) {
512    poll(list, 1, -1);
513    for ( ; ring->avail > 0 ; ring->avail--) {
514        i = ring->cur;
515        buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
516        ... prepare packet in buf ...
517        ring->slot[i].len = ... packet length ...
518        ring->cur = NETMAP_RING_NEXT(ring, i);
519    }
520}
521.Ed
522.Sh SUPPORTED INTERFACES
523.Nm
524supports the following interfaces:
525.Xr em 4 ,
526.Xr igb 4 ,
527.Xr ixgbe 4 ,
528.Xr lem 4 ,
529.Xr re 4
530.Sh SEE ALSO
531.Xr vale 4
532.Pp
533http://info.iet.unipi.it/~luigi/netmap/
534.Pp
535Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
536Communications of the ACM, 55 (3), pp.45-51, March 2012
537.Pp
538Luigi Rizzo, netmap: a novel framework for fast packet I/O,
539Usenix ATC'12, June 2012, Boston
540.Sh AUTHORS
541.An -nosplit
542The
543.Nm
544framework has been originally designed and implemented at the
545Universita` di Pisa in 2011 by
546.An Luigi Rizzo ,
547and further extended with help from
548.An Matteo Landi ,
549.An Gaetano Catalli ,
550.An Giuseppe Lettieri ,
551.An Vincenzo Maffione .
552.Pp
553.Nm
554and
555.Nm VALE
556have been funded by the European Commission within FP7 Projects
557CHANGE (257422) and OPENLAB (287581).
558