xref: /freebsd/share/man/man4/netmap.4 (revision a4bcd20486f8c20cc875b39bc75aa0d5a047373f)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD$
29.\"
30.Dd October 3, 2020
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.Sh SYNOPSIS
37.Cd device netmap
38.Sh DESCRIPTION
39.Nm
40is a framework for extremely fast and efficient packet I/O
41for userspace and kernel clients, and for Virtual Machines.
42It runs on
43.Fx ,
44Linux and some versions of Windows, and supports a variety of
45.Nm netmap ports ,
46including
47.Bl -tag -width XXXX
48.It Nm physical NIC ports
49to access individual queues of network interfaces;
50.It Nm host ports
51to inject packets into the host stack;
52.It Nm VALE ports
53implementing a very fast and modular in-kernel software switch/dataplane;
54.It Nm netmap pipes
55a shared memory packet transport channel;
56.It Nm netmap monitors
57a mechanism similar to
58.Xr bpf 4
59to capture traffic
60.El
61.Pp
62All these
63.Nm netmap ports
64are accessed interchangeably with the same API,
65and are at least one order of magnitude faster than
66standard OS mechanisms
67(sockets, bpf, tun/tap interfaces, native switches, pipes).
68With suitably fast hardware (NICs, PCIe buses, CPUs),
69packet I/O using
70.Nm
71on supported NICs
72reaches 14.88 million packets per second (Mpps)
73with much less than one core on 10 Gbit/s NICs;
7435-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
75about 20 Mpps per core for VALE ports;
76and over 100 Mpps for
77.Nm netmap pipes .
78NICs without native
79.Nm
80support can still use the API in emulated mode,
81which uses unmodified device drivers and is 3-5 times faster than
82.Xr bpf 4
83or raw sockets.
84.Pp
85Userspace clients can dynamically switch NICs into
86.Nm
87mode and send and receive raw packets through
88memory mapped buffers.
89Similarly,
90.Nm VALE
91switch instances and ports,
92.Nm netmap pipes
93and
94.Nm netmap monitors
95can be created dynamically,
96providing high speed packet I/O between processes,
97virtual machines, NICs and the host stack.
98.Pp
99.Nm
100supports both non-blocking I/O through
101.Xr ioctl 2 ,
102synchronization and blocking I/O through a file descriptor
103and standard OS mechanisms such as
104.Xr select 2 ,
105.Xr poll 2 ,
106.Xr kqueue 2
107and
108.Xr epoll 7 .
109All types of
110.Nm netmap ports
111and the
112.Nm VALE switch
113are implemented by a single kernel module, which also emulates the
114.Nm
115API over standard drivers.
116For best performance,
117.Nm
118requires native support in device drivers.
119A list of such devices is at the end of this document.
120.Pp
121In the rest of this (long) manual page we document
122various aspects of the
123.Nm
124and
125.Nm VALE
126architecture, features and usage.
127.Sh ARCHITECTURE
128.Nm
129supports raw packet I/O through a
130.Em port ,
131which can be connected to a physical interface
132.Em ( NIC ) ,
133to the host stack,
134or to a
135.Nm VALE
136switch.
137Ports use preallocated circular queues of buffers
138.Em ( rings )
139residing in an mmapped region.
140There is one ring for each transmit/receive queue of a
141NIC or virtual port.
142An additional ring pair connects to the host stack.
143.Pp
144After binding a file descriptor to a port, a
145.Nm
146client can send or receive packets in batches through
147the rings, and possibly implement zero-copy forwarding
148between ports.
149.Pp
150All NICs operating in
151.Nm
152mode use the same memory region,
153accessible to all processes who own
154.Pa /dev/netmap
155file descriptors bound to NICs.
156Independent
157.Nm VALE
158and
159.Nm netmap pipe
160ports
161by default use separate memory regions,
162but can be independently configured to share memory.
163.Sh ENTERING AND EXITING NETMAP MODE
164The following section describes the system calls to create
165and control
166.Nm netmap
167ports (including
168.Nm VALE
169and
170.Nm netmap pipe
171ports).
172Simpler, higher level functions are described in the
173.Sx LIBRARIES
174section.
175.Pp
176Ports and rings are created and controlled through a file descriptor,
177created by opening a special device
178.Dl fd = open("/dev/netmap");
179and then bound to a specific port with an
180.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
181.Pp
182.Nm
183has multiple modes of operation controlled by the
184.Vt struct nmreq
185argument.
186.Va arg.nr_name
187specifies the netmap port name, as follows:
188.Bl -tag -width XXXX
189.It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
190the data path of the NIC is disconnected from the host stack,
191and the file descriptor is bound to the NIC (one or all queues),
192or to the host stack;
193.It Dv valeSSS:PPP
194the file descriptor is bound to port PPP of VALE switch SSS.
195Switch instances and ports are dynamically created if necessary.
196.Pp
197Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string
198cannot exceed IFNAMSIZ characters, and PPP cannot
199be the name of any existing OS network interface.
200.El
201.Pp
202On return,
203.Va arg
204indicates the size of the shared memory region,
205and the number, size and location of all the
206.Nm
207data structures, which can be accessed by mmapping the memory
208.Dl char *mem = mmap(0, arg.nr_memsize, fd);
209.Pp
210Non-blocking I/O is done with special
211.Xr ioctl 2
212.Xr select 2
213and
214.Xr poll 2
215on the file descriptor permit blocking I/O.
216.Pp
217While a NIC is in
218.Nm
219mode, the OS will still believe the interface is up and running.
220OS-generated packets for that NIC end up into a
221.Nm
222ring, and another ring is used to send packets into the OS network stack.
223A
224.Xr close 2
225on the file descriptor removes the binding,
226and returns the NIC to normal mode (reconnecting the data path
227to the host stack), or destroys the virtual port.
228.Sh DATA STRUCTURES
229The data structures in the mmapped memory region are detailed in
230.In sys/net/netmap.h ,
231which is the ultimate reference for the
232.Nm
233API.
234The main structures and fields are indicated below:
235.Bl -tag -width XXX
236.It Dv struct netmap_if (one per interface )
237.Bd -literal
238struct netmap_if {
239    ...
240    const uint32_t   ni_flags;      /* properties              */
241    ...
242    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
243    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
244    uint32_t         ni_bufs_head;  /* head of extra bufs list */
245    ...
246};
247.Ed
248.Pp
249Indicates the number of available rings
250.Pa ( struct netmap_rings )
251and their position in the mmapped region.
252The number of tx and rx rings
253.Pa ( ni_tx_rings , ni_rx_rings )
254normally depends on the hardware.
255NICs also have an extra tx/rx ring pair connected to the host stack.
256.Em NIOCREGIF
257can also request additional unbound buffers in the same memory space,
258to be used as temporary storage for packets.
259The number of extra
260buffers is specified in the
261.Va arg.nr_arg3
262field.
263On success, the kernel writes back to
264.Va arg.nr_arg3
265the number of extra buffers actually allocated (they may be less
266than the amount requested if the memory space ran out of buffers).
267.Pa ni_bufs_head
268contains the index of the first of these extra buffers,
269which are connected in a list (the first uint32_t of each
270buffer being the index of the next buffer in the list).
271A
272.Dv 0
273indicates the end of the list.
274The application is free to modify
275this list and use the buffers (i.e., binding them to the slots of a
276netmap ring).
277When closing the netmap file descriptor,
278the kernel frees the buffers contained in the list pointed by
279.Pa ni_bufs_head
280, irrespectively of the buffers originally provided by the kernel on
281.Em NIOCREGIF .
282.It Dv struct netmap_ring (one per ring )
283.Bd -literal
284struct netmap_ring {
285    ...
286    const uint32_t num_slots;   /* slots in each ring            */
287    const uint32_t nr_buf_size; /* size of each buffer           */
288    ...
289    uint32_t       head;        /* (u) first buf owned by user   */
290    uint32_t       cur;         /* (u) wakeup position           */
291    const uint32_t tail;        /* (k) first buf owned by kernel */
292    ...
293    uint32_t       flags;
294    struct timeval ts;          /* (k) time of last rxsync()     */
295    ...
296    struct netmap_slot slot[0]; /* array of slots                */
297}
298.Ed
299.Pp
300Implements transmit and receive rings, with read/write
301pointers, metadata and an array of
302.Em slots
303describing the buffers.
304.It Dv struct netmap_slot (one per buffer )
305.Bd -literal
306struct netmap_slot {
307    uint32_t buf_idx;           /* buffer index                 */
308    uint16_t len;               /* packet length                */
309    uint16_t flags;             /* buf changed, etc.            */
310    uint64_t ptr;               /* address for indirect buffers */
311};
312.Ed
313.Pp
314Describes a packet buffer, which normally is identified by
315an index and resides in the mmapped region.
316.It Dv packet buffers
317Fixed size (normally 2 KB) packet buffers allocated by the kernel.
318.El
319.Pp
320The offset of the
321.Pa struct netmap_if
322in the mmapped region is indicated by the
323.Pa nr_offset
324field in the structure returned by
325.Dv NIOCREGIF .
326From there, all other objects are reachable through
327relative references (offsets or indexes).
328Macros and functions in
329.In net/netmap_user.h
330help converting them into actual pointers:
331.Pp
332.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
333.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
334.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
335.Pp
336.Dl char *buf = NETMAP_BUF(ring, buffer_index);
337.Sh RINGS, BUFFERS AND DATA I/O
338.Va Rings
339are circular queues of packets with three indexes/pointers
340.Va ( head , cur , tail ) ;
341one slot is always kept empty.
342The ring size
343.Va ( num_slots )
344should not be assumed to be a power of two.
345.Pp
346.Va head
347is the first slot available to userspace;
348.Pp
349.Va cur
350is the wakeup point:
351select/poll will unblock when
352.Va tail
353passes
354.Va cur ;
355.Pp
356.Va tail
357is the first slot reserved to the kernel.
358.Pp
359Slot indexes
360.Em must
361only move forward;
362for convenience, the function
363.Dl nm_ring_next(ring, index)
364returns the next index modulo the ring size.
365.Pp
366.Va head
367and
368.Va cur
369are only modified by the user program;
370.Va tail
371is only modified by the kernel.
372The kernel only reads/writes the
373.Vt struct netmap_ring
374slots and buffers
375during the execution of a netmap-related system call.
376The only exception are slots (and buffers) in the range
377.Va tail\  . . . head-1 ,
378that are explicitly assigned to the kernel.
379.Ss TRANSMIT RINGS
380On transmit rings, after a
381.Nm
382system call, slots in the range
383.Va head\  . . . tail-1
384are available for transmission.
385User code should fill the slots sequentially
386and advance
387.Va head
388and
389.Va cur
390past slots ready to transmit.
391.Va cur
392may be moved further ahead if the user code needs
393more slots before further transmissions (see
394.Sx SCATTER GATHER I/O ) .
395.Pp
396At the next NIOCTXSYNC/select()/poll(),
397slots up to
398.Va head-1
399are pushed to the port, and
400.Va tail
401may advance if further slots have become available.
402Below is an example of the evolution of a TX ring:
403.Bd -literal
404    after the syscall, slots between cur and tail are (a)vailable
405              head=cur   tail
406               |          |
407               v          v
408     TX  [.....aaaaaaaaaaa.............]
409
410    user creates new packets to (T)ransmit
411                head=cur tail
412                    |     |
413                    v     v
414     TX  [.....TTTTTaaaaaa.............]
415
416    NIOCTXSYNC/poll()/select() sends packets and reports new slots
417                head=cur      tail
418                    |          |
419                    v          v
420     TX  [..........aaaaaaaaaaa........]
421.Ed
422.Pp
423.Fn select
424and
425.Fn poll
426will block if there is no space in the ring, i.e.,
427.Dl ring->cur == ring->tail
428and return when new slots have become available.
429.Pp
430High speed applications may want to amortize the cost of system calls
431by preparing as many packets as possible before issuing them.
432.Pp
433A transmit ring with pending transmissions has
434.Dl ring->head != ring->tail + 1 (modulo the ring size).
435The function
436.Va int nm_tx_pending(ring)
437implements this test.
438.Ss RECEIVE RINGS
439On receive rings, after a
440.Nm
441system call, the slots in the range
442.Va head\& . . . tail-1
443contain received packets.
444User code should process them and advance
445.Va head
446and
447.Va cur
448past slots it wants to return to the kernel.
449.Va cur
450may be moved further ahead if the user code wants to
451wait for more packets
452without returning all the previous slots to the kernel.
453.Pp
454At the next NIOCRXSYNC/select()/poll(),
455slots up to
456.Va head-1
457are returned to the kernel for further receives, and
458.Va tail
459may advance to report new incoming packets.
460.Pp
461Below is an example of the evolution of an RX ring:
462.Bd -literal
463    after the syscall, there are some (h)eld and some (R)eceived slots
464           head  cur     tail
465            |     |       |
466            v     v       v
467     RX  [..hhhhhhRRRRRRRR..........]
468
469    user advances head and cur, releasing some slots and holding others
470               head cur  tail
471                 |  |     |
472                 v  v     v
473     RX  [..*****hhhRRRRRR...........]
474
475    NICRXSYNC/poll()/select() recovers slots and reports new packets
476               head cur        tail
477                 |  |           |
478                 v  v           v
479     RX  [.......hhhRRRRRRRRRRRR....]
480.Ed
481.Sh SLOTS AND PACKET BUFFERS
482Normally, packets should be stored in the netmap-allocated buffers
483assigned to slots when ports are bound to a file descriptor.
484One packet is fully contained in a single buffer.
485.Pp
486The following flags affect slot and buffer processing:
487.Bl -tag -width XXX
488.It NS_BUF_CHANGED
489.Em must
490be used when the
491.Va buf_idx
492in the slot is changed.
493This can be used to implement
494zero-copy forwarding, see
495.Sx ZERO-COPY FORWARDING .
496.It NS_REPORT
497reports when this buffer has been transmitted.
498Normally,
499.Nm
500notifies transmit completions in batches, hence signals
501can be delayed indefinitely.
502This flag helps detect
503when packets have been sent and a file descriptor can be closed.
504.It NS_FORWARD
505When a ring is in 'transparent' mode,
506packets marked with this flag by the user application are forwarded to the
507other endpoint at the next system call, thus restoring (in a selective way)
508the connection between a NIC and the host stack.
509.It NS_NO_LEARN
510tells the forwarding code that the source MAC address for this
511packet must not be used in the learning bridge code.
512.It NS_INDIRECT
513indicates that the packet's payload is in a user-supplied buffer
514whose user virtual address is in the 'ptr' field of the slot.
515The size can reach 65535 bytes.
516.Pp
517This is only supported on the transmit ring of
518.Nm VALE
519ports, and it helps reducing data copies in the interconnection
520of virtual machines.
521.It NS_MOREFRAG
522indicates that the packet continues with subsequent buffers;
523the last buffer in a packet must have the flag clear.
524.El
525.Sh SCATTER GATHER I/O
526Packets can span multiple slots if the
527.Va NS_MOREFRAG
528flag is set in all but the last slot.
529The maximum length of a chain is 64 buffers.
530This is normally used with
531.Nm VALE
532ports when connecting virtual machines, as they generate large
533TSO segments that are not split unless they reach a physical device.
534.Pp
535NOTE: The length field always refers to the individual
536fragment; there is no place with the total length of a packet.
537.Pp
538On receive rings the macro
539.Va NS_RFRAGS(slot)
540indicates the remaining number of slots for this packet,
541including the current one.
542Slots with a value greater than 1 also have NS_MOREFRAG set.
543.Sh IOCTLS
544.Nm
545uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
546for non-blocking I/O.
547They take no argument.
548Two more ioctls (NIOCGINFO, NIOCREGIF) are used
549to query and configure ports, with the following argument:
550.Bd -literal
551struct nmreq {
552    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
553    uint32_t  nr_version;        /* (i) API version                */
554    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
555    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
556    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
557    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
558    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
559    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
560    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
561    uint16_t  nr_cmd;            /* (i) special command            */
562    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
563    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
564    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
565    uint32_t  nr_flags           /* (i/o) open mode                */
566    ...
567};
568.Ed
569.Pp
570A file descriptor obtained through
571.Pa /dev/netmap
572also supports the ioctl supported by network devices, see
573.Xr netintro 4 .
574.Bl -tag -width XXXX
575.It Dv NIOCGINFO
576returns EINVAL if the named port does not support netmap.
577Otherwise, it returns 0 and (advisory) information
578about the port.
579Note that all the information below can change before the
580interface is actually put in netmap mode.
581.Bl -tag -width XX
582.It Pa nr_memsize
583indicates the size of the
584.Nm
585memory region.
586NICs in
587.Nm
588mode all share the same memory region,
589whereas
590.Nm VALE
591ports have independent regions for each port.
592.It Pa nr_tx_slots , nr_rx_slots
593indicate the size of transmit and receive rings.
594.It Pa nr_tx_rings , nr_rx_rings
595indicate the number of transmit
596and receive rings.
597Both ring number and sizes may be configured at runtime
598using interface-specific functions (e.g.,
599.Xr ethtool 8
600).
601.El
602.It Dv NIOCREGIF
603binds the port named in
604.Va nr_name
605to the file descriptor.
606For a physical device this also switches it into
607.Nm
608mode, disconnecting
609it from the host stack.
610Multiple file descriptors can be bound to the same port,
611with proper synchronization left to the user.
612.Pp
613The recommended way to bind a file descriptor to a port is
614to use function
615.Va nm_open(..)
616(see
617.Sx LIBRARIES )
618which parses names to access specific port types and
619enable features.
620In the following we document the main features.
621.Pp
622.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
623.Em netmap pipe ,
624consisting of two netmap ports with a crossover connection.
625A netmap pipe share the same memory space of the parent port,
626and is meant to enable configuration where a master process acts
627as a dispatcher towards slave processes.
628.Pp
629To enable this function, the
630.Pa nr_arg1
631field of the structure can be used as a hint to the kernel to
632indicate how many pipes we expect to use, and reserve extra space
633in the memory region.
634.Pp
635On return, it gives the same info as NIOCGINFO,
636with
637.Pa nr_ringid
638and
639.Pa nr_flags
640indicating the identity of the rings controlled through the file
641descriptor.
642.Pp
643.Va nr_flags
644.Va nr_ringid
645selects which rings are controlled through this file descriptor.
646Possible values of
647.Pa nr_flags
648are indicated below, together with the naming schemes
649that application libraries (such as the
650.Nm nm_open
651indicated below) can use to indicate the specific set of rings.
652In the example below, "netmap:foo" is any valid netmap port name.
653.Bl -tag -width XXXXX
654.It NR_REG_ALL_NIC                         "netmap:foo"
655(default) all hardware ring pairs
656.It NR_REG_SW            "netmap:foo^"
657the ``host rings'', connecting to the host stack.
658.It NR_REG_NIC_SW        "netmap:foo+"
659all hardware rings and the host rings
660.It NR_REG_ONE_NIC       "netmap:foo-i"
661only the i-th hardware ring pair, where the number is in
662.Pa nr_ringid ;
663.It NR_REG_PIPE_MASTER  "netmap:foo{i"
664the master side of the netmap pipe whose identifier (i) is in
665.Pa nr_ringid ;
666.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
667the slave side of the netmap pipe whose identifier (i) is in
668.Pa nr_ringid .
669.Pp
670The identifier of a pipe must be thought as part of the pipe name,
671and does not need to be sequential.
672On return the pipe
673will only have a single ring pair with index 0,
674irrespective of the value of
675.Va i .
676.El
677.Pp
678By default, a
679.Xr poll 2
680or
681.Xr select 2
682call pushes out any pending packets on the transmit ring, even if
683no write events are specified.
684The feature can be disabled by or-ing
685.Va NETMAP_NO_TX_POLL
686to the value written to
687.Va nr_ringid .
688When this feature is used,
689packets are transmitted only on
690.Va ioctl(NIOCTXSYNC)
691or
692.Va select() /
693.Va poll()
694are called with a write event (POLLOUT/wfdset) or a full ring.
695.Pp
696When registering a virtual interface that is dynamically created to a
697.Nm VALE
698switch, we can specify the desired number of rings (1 by default,
699and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
700.It Dv NIOCTXSYNC
701tells the hardware of new packets to transmit, and updates the
702number of slots available for transmission.
703.It Dv NIOCRXSYNC
704tells the hardware of consumed packets, and asks for newly available
705packets.
706.El
707.Sh SELECT, POLL, EPOLL, KQUEUE
708.Xr select 2
709and
710.Xr poll 2
711on a
712.Nm
713file descriptor process rings as indicated in
714.Sx TRANSMIT RINGS
715and
716.Sx RECEIVE RINGS ,
717respectively when write (POLLOUT) and read (POLLIN) events are requested.
718Both block if no slots are available in the ring
719.Va ( ring->cur == ring->tail ) .
720Depending on the platform,
721.Xr epoll 7
722and
723.Xr kqueue 2
724are supported too.
725.Pp
726Packets in transmit rings are normally pushed out
727(and buffers reclaimed) even without
728requesting write events.
729Passing the
730.Dv NETMAP_NO_TX_POLL
731flag to
732.Em NIOCREGIF
733disables this feature.
734By default, receive rings are processed only if read
735events are requested.
736Passing the
737.Dv NETMAP_DO_RX_POLL
738flag to
739.Em NIOCREGIF updates receive rings even without read events.
740Note that on
741.Xr epoll 7
742and
743.Xr kqueue 2 ,
744.Dv NETMAP_NO_TX_POLL
745and
746.Dv NETMAP_DO_RX_POLL
747only have an effect when some event is posted for the file descriptor.
748.Sh LIBRARIES
749The
750.Nm
751API is supposed to be used directly, both because of its simplicity and
752for efficient integration with applications.
753.Pp
754For convenience, the
755.In net/netmap_user.h
756header provides a few macros and functions to ease creating
757a file descriptor and doing I/O with a
758.Nm
759port.
760These are loosely modeled after the
761.Xr pcap 3
762API, to ease porting of libpcap-based applications to
763.Nm .
764To use these extra functions, programs should
765.Dl #define NETMAP_WITH_LIBS
766before
767.Dl #include <net/netmap_user.h>
768.Pp
769The following functions are available:
770.Bl -tag -width XXXXX
771.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg )
772similar to
773.Xr pcap_open_live 3 ,
774binds a file descriptor to a port.
775.Bl -tag -width XX
776.It Va ifname
777is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
778.Nm VALE
779port.
780.It Va req
781provides the initial values for the argument to the NIOCREGIF ioctl.
782The nm_flags and nm_ringid values are overwritten by parsing
783ifname and flags, and other fields can be overridden through
784the other two arguments.
785.It Va arg
786points to a struct nm_desc containing arguments (e.g., from a previously
787open file descriptor) that should override the defaults.
788The fields are used as described below
789.It Va flags
790can be set to a combination of the following flags:
791.Va NETMAP_NO_TX_POLL ,
792.Va NETMAP_DO_RX_POLL
793(copied into nr_ringid);
794.Va NM_OPEN_NO_MMAP
795(if arg points to the same memory region,
796avoids the mmap and uses the values from it);
797.Va NM_OPEN_IFNAME
798(ignores ifname and uses the values in arg);
799.Va NM_OPEN_ARG1 ,
800.Va NM_OPEN_ARG2 ,
801.Va NM_OPEN_ARG3
802(uses the fields from arg);
803.Va NM_OPEN_RING_CFG
804(uses the ring number and sizes from arg).
805.El
806.It Va int nm_close(struct nm_desc *d )
807closes the file descriptor, unmaps memory, frees resources.
808.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size )
809similar to
810.Va pcap_inject() ,
811pushes a packet to a ring, returns the size
812of the packet is successful, or 0 on error;
813.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg )
814similar to
815.Va pcap_dispatch() ,
816applies a callback to incoming packets
817.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr )
818similar to
819.Va pcap_next() ,
820fetches the next packet
821.El
822.Sh SUPPORTED DEVICES
823.Nm
824natively supports the following devices:
825.Pp
826On
827.Fx :
828.Xr cxgbe 4 ,
829.Xr em 4 ,
830.Xr iflib 4
831.Pq providing Xr igb 4 and Xr em 4 ,
832.Xr ixgbe 4 ,
833.Xr ixl 4 ,
834.Xr re 4 ,
835.Xr vtnet 4 .
836.Pp
837On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
838.Pp
839NICs without native support can still be used in
840.Nm
841mode through emulation.
842Performance is inferior to native netmap
843mode but still significantly higher than various raw socket types
844(bpf, PF_PACKET, etc.).
845Note that for slow devices (such as 1 Gbit/s and slower NICs,
846or several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
847emulated and native mode will likely have similar or same throughput.
848.Pp
849When emulation is in use, packet sniffer programs such as tcpdump
850could see received packets before they are diverted by netmap.
851This behaviour is not intentional, being just an artifact of the implementation
852of emulation.
853Note that in case the netmap application subsequently moves packets received
854from the emulated adapter onto the host RX ring, the sniffer will intercept
855those packets again, since the packets are injected to the host stack as they
856were received by the network interface.
857.Pp
858Emulation is also available for devices with native netmap support,
859which can be used for testing or performance comparison.
860The sysctl variable
861.Va dev.netmap.admode
862globally controls how netmap mode is implemented.
863.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
864Some aspects of the operation of
865.Nm
866and
867.Nm VALE
868are controlled through sysctl variables on
869.Fx
870.Em ( dev.netmap.* )
871and module parameters on Linux
872.Em ( /sys/module/netmap/parameters/* ) :
873.Bl -tag -width indent
874.It Va dev.netmap.admode: 0
875Controls the use of native or emulated adapter mode.
876.Pp
8770 uses the best available option;
878.Pp
8791 forces native mode and fails if not available;
880.Pp
8812 forces emulated hence never fails.
882.It Va dev.netmap.generic_rings: 1
883Number of rings used for emulated netmap mode
884.It Va dev.netmap.generic_ringsize: 1024
885Ring size used for emulated netmap mode
886.It Va dev.netmap.generic_mit: 100000
887Controls interrupt moderation for emulated mode
888.It Va dev.netmap.fwd: 0
889Forces NS_FORWARD mode
890.It Va dev.netmap.txsync_retry: 2
891Number of txsync loops in the
892.Nm VALE
893flush function
894.It Va dev.netmap.no_pendintr: 1
895Forces recovery of transmit buffers on system calls
896.It Va dev.netmap.no_timestamp: 0
897Disables the update of the timestamp in the netmap ring
898.It Va dev.netmap.verbose: 0
899Verbose kernel messages
900.It Va dev.netmap.buf_num: 163840
901.It Va dev.netmap.buf_size: 2048
902.It Va dev.netmap.ring_num: 200
903.It Va dev.netmap.ring_size: 36864
904.It Va dev.netmap.if_num: 100
905.It Va dev.netmap.if_size: 1024
906Sizes and number of objects (netmap_if, netmap_ring, buffers)
907for the global memory region.
908The only parameter worth modifying is
909.Va dev.netmap.buf_num
910as it impacts the total amount of memory used by netmap.
911.It Va dev.netmap.buf_curr_num: 0
912.It Va dev.netmap.buf_curr_size: 0
913.It Va dev.netmap.ring_curr_num: 0
914.It Va dev.netmap.ring_curr_size: 0
915.It Va dev.netmap.if_curr_num: 0
916.It Va dev.netmap.if_curr_size: 0
917Actual values in use.
918.It Va dev.netmap.priv_buf_num: 4098
919.It Va dev.netmap.priv_buf_size: 2048
920.It Va dev.netmap.priv_ring_num: 4
921.It Va dev.netmap.priv_ring_size: 20480
922.It Va dev.netmap.priv_if_num: 2
923.It Va dev.netmap.priv_if_size: 1024
924Sizes and number of objects (netmap_if, netmap_ring, buffers)
925for private memory regions.
926A separate memory region is used for each
927.Nm VALE
928port and each pair of
929.Nm netmap pipes .
930.It Va dev.netmap.bridge_batch: 1024
931Batch size used when moving packets across a
932.Nm VALE
933switch.
934Values above 64 generally guarantee good
935performance.
936.It Va dev.netmap.ptnet_vnet_hdr: 1
937Allow ptnet devices to use virtio-net headers
938.El
939.Sh SYSTEM CALLS
940.Nm
941uses
942.Xr select 2 ,
943.Xr poll 2 ,
944.Xr epoll 7
945and
946.Xr kqueue 2
947to wake up processes when significant events occur, and
948.Xr mmap 2
949to map memory.
950.Xr ioctl 2
951is used to configure ports and
952.Nm VALE switches .
953.Pp
954Applications may need to create threads and bind them to
955specific cores to improve performance, using standard
956OS primitives, see
957.Xr pthread 3 .
958In particular,
959.Xr pthread_setaffinity_np 3
960may be of use.
961.Sh EXAMPLES
962.Ss TEST PROGRAMS
963.Nm
964comes with a few programs that can be used for testing or
965simple applications.
966See the
967.Pa examples/
968directory in
969.Nm
970distributions, or
971.Pa tools/tools/netmap/
972directory in
973.Fx
974distributions.
975.Pp
976.Xr pkt-gen 8
977is a general purpose traffic source/sink.
978.Pp
979As an example
980.Dl pkt-gen -i ix0 -f tx -l 60
981can generate an infinite stream of minimum size packets, and
982.Dl pkt-gen -i ix0 -f rx
983is a traffic sink.
984Both print traffic statistics, to help monitor
985how the system performs.
986.Pp
987.Xr pkt-gen 8
988has many options can be uses to set packet sizes, addresses,
989rates, and use multiple send/receive threads and cores.
990.Pp
991.Xr bridge 4
992is another test program which interconnects two
993.Nm
994ports.
995It can be used for transparent forwarding between
996interfaces, as in
997.Dl bridge -i netmap:ix0 -i netmap:ix1
998or even connect the NIC to the host stack using netmap
999.Dl bridge -i netmap:ix0
1000.Ss USING THE NATIVE API
1001The following code implements a traffic generator:
1002.Pp
1003.Bd -literal -compact
1004#include <net/netmap_user.h>
1005\&...
1006void sender(void)
1007{
1008    struct netmap_if *nifp;
1009    struct netmap_ring *ring;
1010    struct nmreq nmr;
1011    struct pollfd fds;
1012
1013    fd = open("/dev/netmap", O_RDWR);
1014    bzero(&nmr, sizeof(nmr));
1015    strcpy(nmr.nr_name, "ix0");
1016    nmr.nm_version = NETMAP_API;
1017    ioctl(fd, NIOCREGIF, &nmr);
1018    p = mmap(0, nmr.nr_memsize, fd);
1019    nifp = NETMAP_IF(p, nmr.nr_offset);
1020    ring = NETMAP_TXRING(nifp, 0);
1021    fds.fd = fd;
1022    fds.events = POLLOUT;
1023    for (;;) {
1024	poll(&fds, 1, -1);
1025	while (!nm_ring_empty(ring)) {
1026	    i = ring->cur;
1027	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
1028	    ... prepare packet in buf ...
1029	    ring->slot[i].len = ... packet length ...
1030	    ring->head = ring->cur = nm_ring_next(ring, i);
1031	}
1032    }
1033}
1034.Ed
1035.Ss HELPER FUNCTIONS
1036A simple receiver can be implemented using the helper functions:
1037.Pp
1038.Bd -literal -compact
1039#define NETMAP_WITH_LIBS
1040#include <net/netmap_user.h>
1041\&...
1042void receiver(void)
1043{
1044    struct nm_desc *d;
1045    struct pollfd fds;
1046    u_char *buf;
1047    struct nm_pkthdr h;
1048    ...
1049    d = nm_open("netmap:ix0", NULL, 0, 0);
1050    fds.fd = NETMAP_FD(d);
1051    fds.events = POLLIN;
1052    for (;;) {
1053	poll(&fds, 1, -1);
1054        while ( (buf = nm_nextpkt(d, &h)) )
1055	    consume_pkt(buf, h.len);
1056    }
1057    nm_close(d);
1058}
1059.Ed
1060.Ss ZERO-COPY FORWARDING
1061Since physical interfaces share the same memory region,
1062it is possible to do packet forwarding between ports
1063swapping buffers.
1064The buffer from the transmit ring is used
1065to replenish the receive ring:
1066.Pp
1067.Bd -literal -compact
1068    uint32_t tmp;
1069    struct netmap_slot *src, *dst;
1070    ...
1071    src = &src_ring->slot[rxr->cur];
1072    dst = &dst_ring->slot[txr->cur];
1073    tmp = dst->buf_idx;
1074    dst->buf_idx = src->buf_idx;
1075    dst->len = src->len;
1076    dst->flags = NS_BUF_CHANGED;
1077    src->buf_idx = tmp;
1078    src->flags = NS_BUF_CHANGED;
1079    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1080    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1081    ...
1082.Ed
1083.Ss ACCESSING THE HOST STACK
1084The host stack is for all practical purposes just a regular ring pair,
1085which you can access with the netmap API (e.g., with
1086.Dl nm_open("netmap:eth0^", ... ) ;
1087All packets that the host would send to an interface in
1088.Nm
1089mode end up into the RX ring, whereas all packets queued to the
1090TX ring are send up to the host stack.
1091.Ss VALE SWITCH
1092A simple way to test the performance of a
1093.Nm VALE
1094switch is to attach a sender and a receiver to it,
1095e.g., running the following in two different terminals:
1096.Dl pkt-gen -i vale1:a -f rx # receiver
1097.Dl pkt-gen -i vale1:b -f tx # sender
1098The same example can be used to test netmap pipes, by simply
1099changing port names, e.g.,
1100.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
1101.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
1102.Pp
1103The following command attaches an interface and the host stack
1104to a switch:
1105.Dl valectl -h vale2:em0
1106Other
1107.Nm
1108clients attached to the same switch can now communicate
1109with the network card or the host.
1110.Sh SEE ALSO
1111.Xr vale 4 ,
1112.Xr valectl 8 ,
1113.Xr bridge 8 ,
1114.Xr lb 8 ,
1115.Xr nmreplay 8 ,
1116.Xr pkt-gen 8
1117.Pp
1118.Pa http://info.iet.unipi.it/~luigi/netmap/
1119.Pp
1120Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1121Communications of the ACM, 55 (3), pp.45-51, March 2012
1122.Pp
1123Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1124Usenix ATC'12, June 2012, Boston
1125.Pp
1126Luigi Rizzo, Giuseppe Lettieri,
1127VALE, a switched ethernet for virtual machines,
1128ACM CoNEXT'12, December 2012, Nice
1129.Pp
1130Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1131Speeding up packet I/O in virtual machines,
1132ACM/IEEE ANCS'13, October 2013, San Jose
1133.Sh AUTHORS
1134.An -nosplit
1135The
1136.Nm
1137framework has been originally designed and implemented at the
1138Universita` di Pisa in 2011 by
1139.An Luigi Rizzo ,
1140and further extended with help from
1141.An Matteo Landi ,
1142.An Gaetano Catalli ,
1143.An Giuseppe Lettieri ,
1144and
1145.An Vincenzo Maffione .
1146.Pp
1147.Nm
1148and
1149.Nm VALE
1150have been funded by the European Commission within FP7 Projects
1151CHANGE (257422) and OPENLAB (287581).
1152.Sh CAVEATS
1153No matter how fast the CPU and OS are,
1154achieving line rate on 10G and faster interfaces
1155requires hardware with sufficient performance.
1156Several NICs are unable to sustain line rate with
1157small packet sizes.
1158Insufficient PCIe or memory bandwidth
1159can also cause reduced performance.
1160.Pp
1161Another frequent reason for low performance is the use
1162of flow control on the link: a slow receiver can limit
1163the transmit speed.
1164Be sure to disable flow control when running high
1165speed experiments.
1166.Ss SPECIAL NIC FEATURES
1167.Nm
1168is orthogonal to some NIC features such as
1169multiqueue, schedulers, packet filters.
1170.Pp
1171Multiple transmit and receive rings are supported natively
1172and can be configured with ordinary OS tools,
1173such as
1174.Xr ethtool 8
1175or
1176device-specific sysctl variables.
1177The same goes for Receive Packet Steering (RPS)
1178and filtering of incoming traffic.
1179.Pp
1180.Nm
1181.Em does not use
1182features such as
1183.Em checksum offloading , TCP segmentation offloading ,
1184.Em encryption , VLAN encapsulation/decapsulation ,
1185etc.
1186When using netmap to exchange packets with the host stack,
1187make sure to disable these features.
1188