xref: /freebsd/share/man/man4/netmap.4 (revision 5ab1c5846ff41be24b1f6beb0317bf8258cd4409)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd November 8, 2019
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.Sh SYNOPSIS
37.Cd device netmap
38.Sh DESCRIPTION
39.Nm
40is a framework for extremely fast and efficient packet I/O
41for userspace and kernel clients, and for Virtual Machines.
42It runs on
43.Fx ,
44Linux and some versions of Windows, and supports a variety of
45.Nm netmap ports ,
46including
47.Bl -tag -width XXXX
48.It Nm physical NIC ports
49to access individual queues of network interfaces;
50.It Nm host ports
51to inject packets into the host stack;
52.It Nm VALE ports
53implementing a very fast and modular in-kernel software switch/dataplane;
54.It Nm netmap pipes
55a shared memory packet transport channel;
56.It Nm netmap monitors
57a mechanism similar to
58.Xr bpf 4
59to capture traffic
60.El
61.Pp
62All these
63.Nm netmap ports
64are accessed interchangeably with the same API,
65and are at least one order of magnitude faster than
66standard OS mechanisms
67(sockets, bpf, tun/tap interfaces, native switches, pipes).
68With suitably fast hardware (NICs, PCIe buses, CPUs),
69packet I/O using
70.Nm
71on supported NICs
72reaches 14.88 million packets per second (Mpps)
73with much less than one core on 10 Gbit/s NICs;
7435-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
75about 20 Mpps per core for VALE ports;
76and over 100 Mpps for
77.Nm netmap pipes .
78NICs without native
79.Nm
80support can still use the API in emulated mode,
81which uses unmodified device drivers and is 3-5 times faster than
82.Xr bpf 4
83or raw sockets.
84.Pp
85Userspace clients can dynamically switch NICs into
86.Nm
87mode and send and receive raw packets through
88memory mapped buffers.
89Similarly,
90.Nm VALE
91switch instances and ports,
92.Nm netmap pipes
93and
94.Nm netmap monitors
95can be created dynamically,
96providing high speed packet I/O between processes,
97virtual machines, NICs and the host stack.
98.Pp
99.Nm
100supports both non-blocking I/O through
101.Xr ioctl 2 ,
102synchronization and blocking I/O through a file descriptor
103and standard OS mechanisms such as
104.Xr select 2 ,
105.Xr poll 2 ,
106.Xr kqueue 2
107and
108.Xr epoll 7 .
109All types of
110.Nm netmap ports
111and the
112.Nm VALE switch
113are implemented by a single kernel module, which also emulates the
114.Nm
115API over standard drivers.
116For best performance,
117.Nm
118requires native support in device drivers.
119A list of such devices is at the end of this document.
120.Pp
121In the rest of this (long) manual page we document
122various aspects of the
123.Nm
124and
125.Nm VALE
126architecture, features and usage.
127.Sh ARCHITECTURE
128.Nm
129supports raw packet I/O through a
130.Em port ,
131which can be connected to a physical interface
132.Em ( NIC ) ,
133to the host stack,
134or to a
135.Nm VALE
136switch.
137Ports use preallocated circular queues of buffers
138.Em ( rings )
139residing in an mmapped region.
140There is one ring for each transmit/receive queue of a
141NIC or virtual port.
142An additional ring pair connects to the host stack.
143.Pp
144After binding a file descriptor to a port, a
145.Nm
146client can send or receive packets in batches through
147the rings, and possibly implement zero-copy forwarding
148between ports.
149.Pp
150All NICs operating in
151.Nm
152mode use the same memory region,
153accessible to all processes who own
154.Pa /dev/netmap
155file descriptors bound to NICs.
156Independent
157.Nm VALE
158and
159.Nm netmap pipe
160ports
161by default use separate memory regions,
162but can be independently configured to share memory.
163.Sh ENTERING AND EXITING NETMAP MODE
164The following section describes the system calls to create
165and control
166.Nm netmap
167ports (including
168.Nm VALE
169and
170.Nm netmap pipe
171ports).
172Simpler, higher level functions are described in the
173.Sx LIBRARIES
174section.
175.Pp
176Ports and rings are created and controlled through a file descriptor,
177created by opening a special device
178.Dl fd = open("/dev/netmap");
179and then bound to a specific port with an
180.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
181.Pp
182.Nm
183has multiple modes of operation controlled by the
184.Vt struct nmreq
185argument.
186.Va arg.nr_name
187specifies the netmap port name, as follows:
188.Bl -tag -width XXXX
189.It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
190the data path of the NIC is disconnected from the host stack,
191and the file descriptor is bound to the NIC (one or all queues),
192or to the host stack;
193.It Dv valeSSS:PPP
194the file descriptor is bound to port PPP of VALE switch SSS.
195Switch instances and ports are dynamically created if necessary.
196.Pp
197Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string
198cannot exceed IFNAMSIZ characters, and PPP cannot
199be the name of any existing OS network interface.
200.El
201.Pp
202On return,
203.Va arg
204indicates the size of the shared memory region,
205and the number, size and location of all the
206.Nm
207data structures, which can be accessed by mmapping the memory
208.Dl char *mem = mmap(0, arg.nr_memsize, fd);
209.Pp
210Non-blocking I/O is done with special
211.Xr ioctl 2
212.Xr select 2
213and
214.Xr poll 2
215on the file descriptor permit blocking I/O.
216.Pp
217While a NIC is in
218.Nm
219mode, the OS will still believe the interface is up and running.
220OS-generated packets for that NIC end up into a
221.Nm
222ring, and another ring is used to send packets into the OS network stack.
223A
224.Xr close 2
225on the file descriptor removes the binding,
226and returns the NIC to normal mode (reconnecting the data path
227to the host stack), or destroys the virtual port.
228.Sh DATA STRUCTURES
229The data structures in the mmapped memory region are detailed in
230.In sys/net/netmap.h ,
231which is the ultimate reference for the
232.Nm
233API.
234The main structures and fields are indicated below:
235.Bl -tag -width XXX
236.It Dv struct netmap_if (one per interface )
237.Bd -literal
238struct netmap_if {
239    ...
240    const uint32_t   ni_flags;      /* properties              */
241    ...
242    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
243    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
244    uint32_t         ni_bufs_head;  /* head of extra bufs list */
245    ...
246};
247.Ed
248.Pp
249Indicates the number of available rings
250.Pa ( struct netmap_rings )
251and their position in the mmapped region.
252The number of tx and rx rings
253.Pa ( ni_tx_rings , ni_rx_rings )
254normally depends on the hardware.
255NICs also have an extra tx/rx ring pair connected to the host stack.
256.Em NIOCREGIF
257can also request additional unbound buffers in the same memory space,
258to be used as temporary storage for packets.
259The number of extra
260buffers is specified in the
261.Va arg.nr_arg3
262field.
263On success, the kernel writes back to
264.Va arg.nr_arg3
265the number of extra buffers actually allocated (they may be less
266than the amount requested if the memory space ran out of buffers).
267.Pa ni_bufs_head
268contains the index of the first of these extra buffers,
269which are connected in a list (the first uint32_t of each
270buffer being the index of the next buffer in the list).
271A
272.Dv 0
273indicates the end of the list.
274The application is free to modify
275this list and use the buffers (i.e., binding them to the slots of a
276netmap ring).
277When closing the netmap file descriptor,
278the kernel frees the buffers contained in the list pointed by
279.Pa ni_bufs_head
280, irrespectively of the buffers originally provided by the kernel on
281.Em NIOCREGIF .
282.It Dv struct netmap_ring (one per ring )
283.Bd -literal
284struct netmap_ring {
285    ...
286    const uint32_t num_slots;   /* slots in each ring            */
287    const uint32_t nr_buf_size; /* size of each buffer           */
288    ...
289    uint32_t       head;        /* (u) first buf owned by user   */
290    uint32_t       cur;         /* (u) wakeup position           */
291    const uint32_t tail;        /* (k) first buf owned by kernel */
292    ...
293    uint32_t       flags;
294    struct timeval ts;          /* (k) time of last rxsync()     */
295    ...
296    struct netmap_slot slot[0]; /* array of slots                */
297}
298.Ed
299.Pp
300Implements transmit and receive rings, with read/write
301pointers, metadata and an array of
302.Em slots
303describing the buffers.
304.It Dv struct netmap_slot (one per buffer )
305.Bd -literal
306struct netmap_slot {
307    uint32_t buf_idx;           /* buffer index                 */
308    uint16_t len;               /* packet length                */
309    uint16_t flags;             /* buf changed, etc.            */
310    uint64_t ptr;               /* address for indirect buffers */
311};
312.Ed
313.Pp
314Describes a packet buffer, which normally is identified by
315an index and resides in the mmapped region.
316.It Dv packet buffers
317Fixed size (normally 2 KB) packet buffers allocated by the kernel.
318.El
319.Pp
320The offset of the
321.Pa struct netmap_if
322in the mmapped region is indicated by the
323.Pa nr_offset
324field in the structure returned by
325.Dv NIOCREGIF .
326From there, all other objects are reachable through
327relative references (offsets or indexes).
328Macros and functions in
329.In net/netmap_user.h
330help converting them into actual pointers:
331.Pp
332.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
333.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
334.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
335.Pp
336.Dl char *buf = NETMAP_BUF(ring, buffer_index);
337.Sh RINGS, BUFFERS AND DATA I/O
338.Va Rings
339are circular queues of packets with three indexes/pointers
340.Va ( head , cur , tail ) ;
341one slot is always kept empty.
342The ring size
343.Va ( num_slots )
344should not be assumed to be a power of two.
345.Pp
346.Va head
347is the first slot available to userspace;
348.Pp
349.Va cur
350is the wakeup point:
351select/poll will unblock when
352.Va tail
353passes
354.Va cur ;
355.Pp
356.Va tail
357is the first slot reserved to the kernel.
358.Pp
359Slot indexes
360.Em must
361only move forward;
362for convenience, the function
363.Dl nm_ring_next(ring, index)
364returns the next index modulo the ring size.
365.Pp
366.Va head
367and
368.Va cur
369are only modified by the user program;
370.Va tail
371is only modified by the kernel.
372The kernel only reads/writes the
373.Vt struct netmap_ring
374slots and buffers
375during the execution of a netmap-related system call.
376The only exception are slots (and buffers) in the range
377.Va tail\  . . . head-1 ,
378that are explicitly assigned to the kernel.
379.Ss TRANSMIT RINGS
380On transmit rings, after a
381.Nm
382system call, slots in the range
383.Va head\  . . . tail-1
384are available for transmission.
385User code should fill the slots sequentially
386and advance
387.Va head
388and
389.Va cur
390past slots ready to transmit.
391.Va cur
392may be moved further ahead if the user code needs
393more slots before further transmissions (see
394.Sx SCATTER GATHER I/O ) .
395.Pp
396At the next NIOCTXSYNC/select()/poll(),
397slots up to
398.Va head-1
399are pushed to the port, and
400.Va tail
401may advance if further slots have become available.
402Below is an example of the evolution of a TX ring:
403.Bd -literal
404    after the syscall, slots between cur and tail are (a)vailable
405              head=cur   tail
406               |          |
407               v          v
408     TX  [.....aaaaaaaaaaa.............]
409
410    user creates new packets to (T)ransmit
411                head=cur tail
412                    |     |
413                    v     v
414     TX  [.....TTTTTaaaaaa.............]
415
416    NIOCTXSYNC/poll()/select() sends packets and reports new slots
417                head=cur      tail
418                    |          |
419                    v          v
420     TX  [..........aaaaaaaaaaa........]
421.Ed
422.Pp
423.Fn select
424and
425.Fn poll
426will block if there is no space in the ring, i.e.,
427.Dl ring->cur == ring->tail
428and return when new slots have become available.
429.Pp
430High speed applications may want to amortize the cost of system calls
431by preparing as many packets as possible before issuing them.
432.Pp
433A transmit ring with pending transmissions has
434.Dl ring->head != ring->tail + 1 (modulo the ring size).
435The function
436.Va int nm_tx_pending(ring)
437implements this test.
438.Ss RECEIVE RINGS
439On receive rings, after a
440.Nm
441system call, the slots in the range
442.Va head\& . . . tail-1
443contain received packets.
444User code should process them and advance
445.Va head
446and
447.Va cur
448past slots it wants to return to the kernel.
449.Va cur
450may be moved further ahead if the user code wants to
451wait for more packets
452without returning all the previous slots to the kernel.
453.Pp
454At the next NIOCRXSYNC/select()/poll(),
455slots up to
456.Va head-1
457are returned to the kernel for further receives, and
458.Va tail
459may advance to report new incoming packets.
460.Pp
461Below is an example of the evolution of an RX ring:
462.Bd -literal
463    after the syscall, there are some (h)eld and some (R)eceived slots
464           head  cur     tail
465            |     |       |
466            v     v       v
467     RX  [..hhhhhhRRRRRRRR..........]
468
469    user advances head and cur, releasing some slots and holding others
470               head cur  tail
471                 |  |     |
472                 v  v     v
473     RX  [..*****hhhRRRRRR...........]
474
475    NICRXSYNC/poll()/select() recovers slots and reports new packets
476               head cur        tail
477                 |  |           |
478                 v  v           v
479     RX  [.......hhhRRRRRRRRRRRR....]
480.Ed
481.Sh SLOTS AND PACKET BUFFERS
482Normally, packets should be stored in the netmap-allocated buffers
483assigned to slots when ports are bound to a file descriptor.
484One packet is fully contained in a single buffer.
485.Pp
486The following flags affect slot and buffer processing:
487.Bl -tag -width XXX
488.It NS_BUF_CHANGED
489.Em must
490be used when the
491.Va buf_idx
492in the slot is changed.
493This can be used to implement
494zero-copy forwarding, see
495.Sx ZERO-COPY FORWARDING .
496.It NS_REPORT
497reports when this buffer has been transmitted.
498Normally,
499.Nm
500notifies transmit completions in batches, hence signals
501can be delayed indefinitely.
502This flag helps detect
503when packets have been sent and a file descriptor can be closed.
504.It NS_FORWARD
505When a ring is in 'transparent' mode,
506packets marked with this flag by the user application are forwarded to the
507other endpoint at the next system call, thus restoring (in a selective way)
508the connection between a NIC and the host stack.
509.It NS_NO_LEARN
510tells the forwarding code that the source MAC address for this
511packet must not be used in the learning bridge code.
512.It NS_INDIRECT
513indicates that the packet's payload is in a user-supplied buffer
514whose user virtual address is in the 'ptr' field of the slot.
515The size can reach 65535 bytes.
516.Pp
517This is only supported on the transmit ring of
518.Nm VALE
519ports, and it helps reducing data copies in the interconnection
520of virtual machines.
521.It NS_MOREFRAG
522indicates that the packet continues with subsequent buffers;
523the last buffer in a packet must have the flag clear.
524.El
525.Sh SCATTER GATHER I/O
526Packets can span multiple slots if the
527.Va NS_MOREFRAG
528flag is set in all but the last slot.
529The maximum length of a chain is 64 buffers.
530This is normally used with
531.Nm VALE
532ports when connecting virtual machines, as they generate large
533TSO segments that are not split unless they reach a physical device.
534.Pp
535NOTE: The length field always refers to the individual
536fragment; there is no place with the total length of a packet.
537.Pp
538On receive rings the macro
539.Va NS_RFRAGS(slot)
540indicates the remaining number of slots for this packet,
541including the current one.
542Slots with a value greater than 1 also have NS_MOREFRAG set.
543.Sh IOCTLS
544.Nm
545uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
546for non-blocking I/O.
547They take no argument.
548Two more ioctls (NIOCGINFO, NIOCREGIF) are used
549to query and configure ports, with the following argument:
550.Bd -literal
551struct nmreq {
552    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
553    uint32_t  nr_version;        /* (i) API version                */
554    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
555    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
556    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
557    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
558    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
559    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
560    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
561    uint16_t  nr_cmd;            /* (i) special command            */
562    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
563    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
564    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
565    uint32_t  nr_flags           /* (i/o) open mode                */
566    ...
567};
568.Ed
569.Pp
570A file descriptor obtained through
571.Pa /dev/netmap
572also supports the ioctl supported by network devices, see
573.Xr netintro 4 .
574.Bl -tag -width XXXX
575.It Dv NIOCGINFO
576returns EINVAL if the named port does not support netmap.
577Otherwise, it returns 0 and (advisory) information
578about the port.
579Note that all the information below can change before the
580interface is actually put in netmap mode.
581.Bl -tag -width XX
582.It Pa nr_memsize
583indicates the size of the
584.Nm
585memory region.
586NICs in
587.Nm
588mode all share the same memory region,
589whereas
590.Nm VALE
591ports have independent regions for each port.
592.It Pa nr_tx_slots , nr_rx_slots
593indicate the size of transmit and receive rings.
594.It Pa nr_tx_rings , nr_rx_rings
595indicate the number of transmit
596and receive rings.
597Both ring number and sizes may be configured at runtime
598using interface-specific functions (e.g.,
599.Xr ethtool 8
600).
601.El
602.It Dv NIOCREGIF
603binds the port named in
604.Va nr_name
605to the file descriptor.
606For a physical device this also switches it into
607.Nm
608mode, disconnecting
609it from the host stack.
610Multiple file descriptors can be bound to the same port,
611with proper synchronization left to the user.
612.Pp
613The recommended way to bind a file descriptor to a port is
614to use function
615.Va nm_open(..)
616(see
617.Sx LIBRARIES )
618which parses names to access specific port types and
619enable features.
620In the following we document the main features.
621.Pp
622.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
623.Em netmap pipe ,
624consisting of two netmap ports with a crossover connection.
625A netmap pipe share the same memory space of the parent port,
626and is meant to enable configuration where a master process acts
627as a dispatcher towards slave processes.
628.Pp
629To enable this function, the
630.Pa nr_arg1
631field of the structure can be used as a hint to the kernel to
632indicate how many pipes we expect to use, and reserve extra space
633in the memory region.
634.Pp
635On return, it gives the same info as NIOCGINFO,
636with
637.Pa nr_ringid
638and
639.Pa nr_flags
640indicating the identity of the rings controlled through the file
641descriptor.
642.Pp
643.Va nr_flags
644.Va nr_ringid
645selects which rings are controlled through this file descriptor.
646Possible values of
647.Pa nr_flags
648are indicated below, together with the naming schemes
649that application libraries (such as the
650.Nm nm_open
651indicated below) can use to indicate the specific set of rings.
652In the example below, "netmap:foo" is any valid netmap port name.
653.Bl -tag -width XXXXX
654.It NR_REG_ALL_NIC                         "netmap:foo"
655(default) all hardware ring pairs
656.It NR_REG_SW            "netmap:foo^"
657the ``host rings'', connecting to the host stack.
658.It NR_REG_NIC_SW        "netmap:foo+"
659all hardware rings and the host rings
660.It NR_REG_ONE_NIC       "netmap:foo-i"
661only the i-th hardware ring pair, where the number is in
662.Pa nr_ringid ;
663.It NR_REG_PIPE_MASTER  "netmap:foo{i"
664the master side of the netmap pipe whose identifier (i) is in
665.Pa nr_ringid ;
666.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
667the slave side of the netmap pipe whose identifier (i) is in
668.Pa nr_ringid .
669.Pp
670The identifier of a pipe must be thought as part of the pipe name,
671and does not need to be sequential.
672On return the pipe
673will only have a single ring pair with index 0,
674irrespective of the value of
675.Va i .
676.El
677.Pp
678By default, a
679.Xr poll 2
680or
681.Xr select 2
682call pushes out any pending packets on the transmit ring, even if
683no write events are specified.
684The feature can be disabled by or-ing
685.Va NETMAP_NO_TX_POLL
686to the value written to
687.Va nr_ringid .
688When this feature is used,
689packets are transmitted only on
690.Va ioctl(NIOCTXSYNC)
691or
692.Va select() /
693.Va poll()
694are called with a write event (POLLOUT/wfdset) or a full ring.
695.Pp
696When registering a virtual interface that is dynamically created to a
697.Xr vale 4
698switch, we can specify the desired number of rings (1 by default,
699and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
700.It Dv NIOCTXSYNC
701tells the hardware of new packets to transmit, and updates the
702number of slots available for transmission.
703.It Dv NIOCRXSYNC
704tells the hardware of consumed packets, and asks for newly available
705packets.
706.El
707.Sh SELECT, POLL, EPOLL, KQUEUE
708.Xr select 2
709and
710.Xr poll 2
711on a
712.Nm
713file descriptor process rings as indicated in
714.Sx TRANSMIT RINGS
715and
716.Sx RECEIVE RINGS ,
717respectively when write (POLLOUT) and read (POLLIN) events are requested.
718Both block if no slots are available in the ring
719.Va ( ring->cur == ring->tail ) .
720Depending on the platform,
721.Xr epoll 7
722and
723.Xr kqueue 2
724are supported too.
725.Pp
726Packets in transmit rings are normally pushed out
727(and buffers reclaimed) even without
728requesting write events.
729Passing the
730.Dv NETMAP_NO_TX_POLL
731flag to
732.Em NIOCREGIF
733disables this feature.
734By default, receive rings are processed only if read
735events are requested.
736Passing the
737.Dv NETMAP_DO_RX_POLL
738flag to
739.Em NIOCREGIF updates receive rings even without read events.
740Note that on
741.Xr epoll 7
742and
743.Xr kqueue 2 ,
744.Dv NETMAP_NO_TX_POLL
745and
746.Dv NETMAP_DO_RX_POLL
747only have an effect when some event is posted for the file descriptor.
748.Sh LIBRARIES
749The
750.Nm
751API is supposed to be used directly, both because of its simplicity and
752for efficient integration with applications.
753.Pp
754For convenience, the
755.In net/netmap_user.h
756header provides a few macros and functions to ease creating
757a file descriptor and doing I/O with a
758.Nm
759port.
760These are loosely modeled after the
761.Xr pcap 3
762API, to ease porting of libpcap-based applications to
763.Nm .
764To use these extra functions, programs should
765.Dl #define NETMAP_WITH_LIBS
766before
767.Dl #include <net/netmap_user.h>
768.Pp
769The following functions are available:
770.Bl -tag -width XXXXX
771.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg )
772similar to
773.Xr pcap_open_live 3 ,
774binds a file descriptor to a port.
775.Bl -tag -width XX
776.It Va ifname
777is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
778.Nm VALE
779port.
780.It Va req
781provides the initial values for the argument to the NIOCREGIF ioctl.
782The nm_flags and nm_ringid values are overwritten by parsing
783ifname and flags, and other fields can be overridden through
784the other two arguments.
785.It Va arg
786points to a struct nm_desc containing arguments (e.g., from a previously
787open file descriptor) that should override the defaults.
788The fields are used as described below
789.It Va flags
790can be set to a combination of the following flags:
791.Va NETMAP_NO_TX_POLL ,
792.Va NETMAP_DO_RX_POLL
793(copied into nr_ringid);
794.Va NM_OPEN_NO_MMAP
795(if arg points to the same memory region,
796avoids the mmap and uses the values from it);
797.Va NM_OPEN_IFNAME
798(ignores ifname and uses the values in arg);
799.Va NM_OPEN_ARG1 ,
800.Va NM_OPEN_ARG2 ,
801.Va NM_OPEN_ARG3
802(uses the fields from arg);
803.Va NM_OPEN_RING_CFG
804(uses the ring number and sizes from arg).
805.El
806.It Va int nm_close(struct nm_desc *d )
807closes the file descriptor, unmaps memory, frees resources.
808.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size )
809similar to
810.Va pcap_inject() ,
811pushes a packet to a ring, returns the size
812of the packet is successful, or 0 on error;
813.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg )
814similar to
815.Va pcap_dispatch() ,
816applies a callback to incoming packets
817.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr )
818similar to
819.Va pcap_next() ,
820fetches the next packet
821.El
822.Sh SUPPORTED DEVICES
823.Nm
824natively supports the following devices:
825.Pp
826On
827.Fx :
828.Xr cxgbe 4 ,
829.Xr em 4 ,
830.Xr iflib 4
831.Pq providing Xr igb 4 and Xr em 4 ,
832.Xr ixgbe 4 ,
833.Xr ixl 4 ,
834.Xr re 4 ,
835.Xr vtnet 4 .
836.Pp
837On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
838.Pp
839NICs without native support can still be used in
840.Nm
841mode through emulation.
842Performance is inferior to native netmap
843mode but still significantly higher than various raw socket types
844(bpf, PF_PACKET, etc.).
845Note that for slow devices (such as 1 Gbit/s and slower NICs,
846or several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
847emulated and native mode will likely have similar or same throughput.
848.Pp
849When emulation is in use, packet sniffer programs such as tcpdump
850could see received packets before they are diverted by netmap.
851This behaviour is not intentional, being just an artifact of the implementation
852of emulation.
853Note that in case the netmap application subsequently moves packets received
854from the emulated adapter onto the host RX ring, the sniffer will intercept
855those packets again, since the packets are injected to the host stack as they
856were received by the network interface.
857.Pp
858Emulation is also available for devices with native netmap support,
859which can be used for testing or performance comparison.
860The sysctl variable
861.Va dev.netmap.admode
862globally controls how netmap mode is implemented.
863.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
864Some aspect of the operation of
865.Nm
866are controlled through sysctl variables on
867.Fx
868.Em ( dev.netmap.* )
869and module parameters on Linux
870.Em ( /sys/module/netmap/parameters/* ) :
871.Bl -tag -width indent
872.It Va dev.netmap.admode: 0
873Controls the use of native or emulated adapter mode.
874.Pp
8750 uses the best available option;
876.Pp
8771 forces native mode and fails if not available;
878.Pp
8792 forces emulated hence never fails.
880.It Va dev.netmap.generic_rings: 1
881Number of rings used for emulated netmap mode
882.It Va dev.netmap.generic_ringsize: 1024
883Ring size used for emulated netmap mode
884.It Va dev.netmap.generic_mit: 100000
885Controls interrupt moderation for emulated mode
886.It Va dev.netmap.mmap_unreg: 0
887.It Va dev.netmap.fwd: 0
888Forces NS_FORWARD mode
889.It Va dev.netmap.flags: 0
890.It Va dev.netmap.txsync_retry: 2
891.It Va dev.netmap.no_pendintr: 1
892Forces recovery of transmit buffers on system calls
893.It Va dev.netmap.mitigate: 1
894Propagates interrupt mitigation to user processes
895.It Va dev.netmap.no_timestamp: 0
896Disables the update of the timestamp in the netmap ring
897.It Va dev.netmap.verbose: 0
898Verbose kernel messages
899.It Va dev.netmap.buf_num: 163840
900.It Va dev.netmap.buf_size: 2048
901.It Va dev.netmap.ring_num: 200
902.It Va dev.netmap.ring_size: 36864
903.It Va dev.netmap.if_num: 100
904.It Va dev.netmap.if_size: 1024
905Sizes and number of objects (netmap_if, netmap_ring, buffers)
906for the global memory region.
907The only parameter worth modifying is
908.Va dev.netmap.buf_num
909as it impacts the total amount of memory used by netmap.
910.It Va dev.netmap.buf_curr_num: 0
911.It Va dev.netmap.buf_curr_size: 0
912.It Va dev.netmap.ring_curr_num: 0
913.It Va dev.netmap.ring_curr_size: 0
914.It Va dev.netmap.if_curr_num: 0
915.It Va dev.netmap.if_curr_size: 0
916Actual values in use.
917.It Va dev.netmap.bridge_batch: 1024
918Batch size used when moving packets across a
919.Nm VALE
920switch.
921Values above 64 generally guarantee good
922performance.
923.It Va dev.netmap.ptnet_vnet_hdr: 1
924Allow ptnet devices to use virtio-net headers
925.El
926.Sh SYSTEM CALLS
927.Nm
928uses
929.Xr select 2 ,
930.Xr poll 2 ,
931.Xr epoll 7
932and
933.Xr kqueue 2
934to wake up processes when significant events occur, and
935.Xr mmap 2
936to map memory.
937.Xr ioctl 2
938is used to configure ports and
939.Nm VALE switches .
940.Pp
941Applications may need to create threads and bind them to
942specific cores to improve performance, using standard
943OS primitives, see
944.Xr pthread 3 .
945In particular,
946.Xr pthread_setaffinity_np 3
947may be of use.
948.Sh EXAMPLES
949.Ss TEST PROGRAMS
950.Nm
951comes with a few programs that can be used for testing or
952simple applications.
953See the
954.Pa examples/
955directory in
956.Nm
957distributions, or
958.Pa tools/tools/netmap/
959directory in
960.Fx
961distributions.
962.Pp
963.Xr pkt-gen 8
964is a general purpose traffic source/sink.
965.Pp
966As an example
967.Dl pkt-gen -i ix0 -f tx -l 60
968can generate an infinite stream of minimum size packets, and
969.Dl pkt-gen -i ix0 -f rx
970is a traffic sink.
971Both print traffic statistics, to help monitor
972how the system performs.
973.Pp
974.Xr pkt-gen 8
975has many options can be uses to set packet sizes, addresses,
976rates, and use multiple send/receive threads and cores.
977.Pp
978.Xr bridge 4
979is another test program which interconnects two
980.Nm
981ports.
982It can be used for transparent forwarding between
983interfaces, as in
984.Dl bridge -i netmap:ix0 -i netmap:ix1
985or even connect the NIC to the host stack using netmap
986.Dl bridge -i netmap:ix0
987.Ss USING THE NATIVE API
988The following code implements a traffic generator
989.Pp
990.Bd -literal -compact
991#include <net/netmap_user.h>
992\&...
993void sender(void)
994{
995    struct netmap_if *nifp;
996    struct netmap_ring *ring;
997    struct nmreq nmr;
998    struct pollfd fds;
999
1000    fd = open("/dev/netmap", O_RDWR);
1001    bzero(&nmr, sizeof(nmr));
1002    strcpy(nmr.nr_name, "ix0");
1003    nmr.nm_version = NETMAP_API;
1004    ioctl(fd, NIOCREGIF, &nmr);
1005    p = mmap(0, nmr.nr_memsize, fd);
1006    nifp = NETMAP_IF(p, nmr.nr_offset);
1007    ring = NETMAP_TXRING(nifp, 0);
1008    fds.fd = fd;
1009    fds.events = POLLOUT;
1010    for (;;) {
1011	poll(&fds, 1, -1);
1012	while (!nm_ring_empty(ring)) {
1013	    i = ring->cur;
1014	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
1015	    ... prepare packet in buf ...
1016	    ring->slot[i].len = ... packet length ...
1017	    ring->head = ring->cur = nm_ring_next(ring, i);
1018	}
1019    }
1020}
1021.Ed
1022.Ss HELPER FUNCTIONS
1023A simple receiver can be implemented using the helper functions
1024.Bd -literal -compact
1025#define NETMAP_WITH_LIBS
1026#include <net/netmap_user.h>
1027\&...
1028void receiver(void)
1029{
1030    struct nm_desc *d;
1031    struct pollfd fds;
1032    u_char *buf;
1033    struct nm_pkthdr h;
1034    ...
1035    d = nm_open("netmap:ix0", NULL, 0, 0);
1036    fds.fd = NETMAP_FD(d);
1037    fds.events = POLLIN;
1038    for (;;) {
1039	poll(&fds, 1, -1);
1040        while ( (buf = nm_nextpkt(d, &h)) )
1041	    consume_pkt(buf, h->len);
1042    }
1043    nm_close(d);
1044}
1045.Ed
1046.Ss ZERO-COPY FORWARDING
1047Since physical interfaces share the same memory region,
1048it is possible to do packet forwarding between ports
1049swapping buffers.
1050The buffer from the transmit ring is used
1051to replenish the receive ring:
1052.Bd -literal -compact
1053    uint32_t tmp;
1054    struct netmap_slot *src, *dst;
1055    ...
1056    src = &src_ring->slot[rxr->cur];
1057    dst = &dst_ring->slot[txr->cur];
1058    tmp = dst->buf_idx;
1059    dst->buf_idx = src->buf_idx;
1060    dst->len = src->len;
1061    dst->flags = NS_BUF_CHANGED;
1062    src->buf_idx = tmp;
1063    src->flags = NS_BUF_CHANGED;
1064    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1065    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1066    ...
1067.Ed
1068.Ss ACCESSING THE HOST STACK
1069The host stack is for all practical purposes just a regular ring pair,
1070which you can access with the netmap API (e.g., with
1071.Dl nm_open("netmap:eth0^", ... ) ;
1072All packets that the host would send to an interface in
1073.Nm
1074mode end up into the RX ring, whereas all packets queued to the
1075TX ring are send up to the host stack.
1076.Ss VALE SWITCH
1077A simple way to test the performance of a
1078.Nm VALE
1079switch is to attach a sender and a receiver to it,
1080e.g., running the following in two different terminals:
1081.Dl pkt-gen -i vale1:a -f rx # receiver
1082.Dl pkt-gen -i vale1:b -f tx # sender
1083The same example can be used to test netmap pipes, by simply
1084changing port names, e.g.,
1085.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
1086.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
1087.Pp
1088The following command attaches an interface and the host stack
1089to a switch:
1090.Dl valectl -h vale2:em0
1091Other
1092.Nm
1093clients attached to the same switch can now communicate
1094with the network card or the host.
1095.Sh SEE ALSO
1096.Xr vale 4 ,
1097.Xr valectl 8 ,
1098.Xr bridge 8 ,
1099.Xr lb 8 ,
1100.Xr nmreplay 8 ,
1101.Xr pkt-gen 8
1102.Pp
1103.Pa http://info.iet.unipi.it/~luigi/netmap/
1104.Pp
1105Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1106Communications of the ACM, 55 (3), pp.45-51, March 2012
1107.Pp
1108Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1109Usenix ATC'12, June 2012, Boston
1110.Pp
1111Luigi Rizzo, Giuseppe Lettieri,
1112VALE, a switched ethernet for virtual machines,
1113ACM CoNEXT'12, December 2012, Nice
1114.Pp
1115Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1116Speeding up packet I/O in virtual machines,
1117ACM/IEEE ANCS'13, October 2013, San Jose
1118.Sh AUTHORS
1119.An -nosplit
1120The
1121.Nm
1122framework has been originally designed and implemented at the
1123Universita` di Pisa in 2011 by
1124.An Luigi Rizzo ,
1125and further extended with help from
1126.An Matteo Landi ,
1127.An Gaetano Catalli ,
1128.An Giuseppe Lettieri ,
1129and
1130.An Vincenzo Maffione .
1131.Pp
1132.Nm
1133and
1134.Nm VALE
1135have been funded by the European Commission within FP7 Projects
1136CHANGE (257422) and OPENLAB (287581).
1137.Sh CAVEATS
1138No matter how fast the CPU and OS are,
1139achieving line rate on 10G and faster interfaces
1140requires hardware with sufficient performance.
1141Several NICs are unable to sustain line rate with
1142small packet sizes.
1143Insufficient PCIe or memory bandwidth
1144can also cause reduced performance.
1145.Pp
1146Another frequent reason for low performance is the use
1147of flow control on the link: a slow receiver can limit
1148the transmit speed.
1149Be sure to disable flow control when running high
1150speed experiments.
1151.Ss SPECIAL NIC FEATURES
1152.Nm
1153is orthogonal to some NIC features such as
1154multiqueue, schedulers, packet filters.
1155.Pp
1156Multiple transmit and receive rings are supported natively
1157and can be configured with ordinary OS tools,
1158such as
1159.Xr ethtool 8
1160or
1161device-specific sysctl variables.
1162The same goes for Receive Packet Steering (RPS)
1163and filtering of incoming traffic.
1164.Pp
1165.Nm
1166.Em does not use
1167features such as
1168.Em checksum offloading , TCP segmentation offloading ,
1169.Em encryption , VLAN encapsulation/decapsulation ,
1170etc.
1171When using netmap to exchange packets with the host stack,
1172make sure to disable these features.
1173