xref: /freebsd/share/man/man4/netmap.4 (revision b9128a37faafede823eb456aa65a11ac69997284)
1.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.Dd March 6, 2022
29.Dt NETMAP 4
30.Os
31.Sh NAME
32.Nm netmap
33.Nd a framework for fast packet I/O
34.Sh SYNOPSIS
35.Cd device netmap
36.Sh DESCRIPTION
37.Nm
38is a framework for extremely fast and efficient packet I/O
39for userspace and kernel clients, and for Virtual Machines.
40It runs on
41.Fx ,
42Linux and some versions of Windows, and supports a variety of
43.Nm netmap ports ,
44including
45.Bl -tag -width XXXX
46.It Nm physical NIC ports
47to access individual queues of network interfaces;
48.It Nm host ports
49to inject packets into the host stack;
50.It Nm VALE ports
51implementing a very fast and modular in-kernel software switch/dataplane;
52.It Nm netmap pipes
53a shared memory packet transport channel;
54.It Nm netmap monitors
55a mechanism similar to
56.Xr bpf 4
57to capture traffic
58.El
59.Pp
60All these
61.Nm netmap ports
62are accessed interchangeably with the same API,
63and are at least one order of magnitude faster than
64standard OS mechanisms
65(sockets, bpf, tun/tap interfaces, native switches, pipes).
66With suitably fast hardware (NICs, PCIe buses, CPUs),
67packet I/O using
68.Nm
69on supported NICs
70reaches 14.88 million packets per second (Mpps)
71with much less than one core on 10 Gbit/s NICs;
7235-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
73about 20 Mpps per core for VALE ports;
74and over 100 Mpps for
75.Nm netmap pipes .
76NICs without native
77.Nm
78support can still use the API in emulated mode,
79which uses unmodified device drivers and is 3-5 times faster than
80.Xr bpf 4
81or raw sockets.
82.Pp
83Userspace clients can dynamically switch NICs into
84.Nm
85mode and send and receive raw packets through
86memory mapped buffers.
87Similarly,
88.Nm VALE
89switch instances and ports,
90.Nm netmap pipes
91and
92.Nm netmap monitors
93can be created dynamically,
94providing high speed packet I/O between processes,
95virtual machines, NICs and the host stack.
96.Pp
97.Nm
98supports both non-blocking I/O through
99.Xr ioctl 2 ,
100synchronization and blocking I/O through a file descriptor
101and standard OS mechanisms such as
102.Xr select 2 ,
103.Xr poll 2 ,
104.Xr kqueue 2
105and
106.Xr epoll 7 .
107All types of
108.Nm netmap ports
109and the
110.Nm VALE switch
111are implemented by a single kernel module, which also emulates the
112.Nm
113API over standard drivers.
114For best performance,
115.Nm
116requires native support in device drivers.
117A list of such devices is at the end of this document.
118.Pp
119In the rest of this (long) manual page we document
120various aspects of the
121.Nm
122and
123.Nm VALE
124architecture, features and usage.
125.Sh ARCHITECTURE
126.Nm
127supports raw packet I/O through a
128.Em port ,
129which can be connected to a physical interface
130.Em ( NIC ) ,
131to the host stack,
132or to a
133.Nm VALE
134switch.
135Ports use preallocated circular queues of buffers
136.Em ( rings )
137residing in an mmapped region.
138There is one ring for each transmit/receive queue of a
139NIC or virtual port.
140An additional ring pair connects to the host stack.
141.Pp
142After binding a file descriptor to a port, a
143.Nm
144client can send or receive packets in batches through
145the rings, and possibly implement zero-copy forwarding
146between ports.
147.Pp
148All NICs operating in
149.Nm
150mode use the same memory region,
151accessible to all processes who own
152.Pa /dev/netmap
153file descriptors bound to NICs.
154Independent
155.Nm VALE
156and
157.Nm netmap pipe
158ports
159by default use separate memory regions,
160but can be independently configured to share memory.
161.Sh ENTERING AND EXITING NETMAP MODE
162The following section describes the system calls to create
163and control
164.Nm netmap
165ports (including
166.Nm VALE
167and
168.Nm netmap pipe
169ports).
170Simpler, higher level functions are described in the
171.Sx LIBRARIES
172section.
173.Pp
174Ports and rings are created and controlled through a file descriptor,
175created by opening a special device
176.Dl fd = open("/dev/netmap");
177and then bound to a specific port with an
178.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
179.Pp
180.Nm
181has multiple modes of operation controlled by the
182.Vt struct nmreq
183argument.
184.Va arg.nr_name
185specifies the netmap port name, as follows:
186.Bl -tag -width XXXX
187.It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
188the data path of the NIC is disconnected from the host stack,
189and the file descriptor is bound to the NIC (one or all queues),
190or to the host stack;
191.It Dv valeSSS:PPP
192the file descriptor is bound to port PPP of VALE switch SSS.
193Switch instances and ports are dynamically created if necessary.
194.Pp
195Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string
196cannot exceed IFNAMSIZ characters, and PPP cannot
197be the name of any existing OS network interface.
198.El
199.Pp
200On return,
201.Va arg
202indicates the size of the shared memory region,
203and the number, size and location of all the
204.Nm
205data structures, which can be accessed by mmapping the memory
206.Dl char *mem = mmap(0, arg.nr_memsize, fd);
207.Pp
208Non-blocking I/O is done with special
209.Xr ioctl 2
210.Xr select 2
211and
212.Xr poll 2
213on the file descriptor permit blocking I/O.
214.Pp
215While a NIC is in
216.Nm
217mode, the OS will still believe the interface is up and running.
218OS-generated packets for that NIC end up into a
219.Nm
220ring, and another ring is used to send packets into the OS network stack.
221A
222.Xr close 2
223on the file descriptor removes the binding,
224and returns the NIC to normal mode (reconnecting the data path
225to the host stack), or destroys the virtual port.
226.Sh DATA STRUCTURES
227The data structures in the mmapped memory region are detailed in
228.In sys/net/netmap.h ,
229which is the ultimate reference for the
230.Nm
231API.
232The main structures and fields are indicated below:
233.Bl -tag -width XXX
234.It Dv struct netmap_if (one per interface )
235.Bd -literal
236struct netmap_if {
237    ...
238    const uint32_t   ni_flags;      /* properties              */
239    ...
240    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
241    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
242    uint32_t         ni_bufs_head;  /* head of extra bufs list */
243    ...
244};
245.Ed
246.Pp
247Indicates the number of available rings
248.Pa ( struct netmap_rings )
249and their position in the mmapped region.
250The number of tx and rx rings
251.Pa ( ni_tx_rings , ni_rx_rings )
252normally depends on the hardware.
253NICs also have an extra tx/rx ring pair connected to the host stack.
254.Em NIOCREGIF
255can also request additional unbound buffers in the same memory space,
256to be used as temporary storage for packets.
257The number of extra
258buffers is specified in the
259.Va arg.nr_arg3
260field.
261On success, the kernel writes back to
262.Va arg.nr_arg3
263the number of extra buffers actually allocated (they may be less
264than the amount requested if the memory space ran out of buffers).
265.Pa ni_bufs_head
266contains the index of the first of these extra buffers,
267which are connected in a list (the first uint32_t of each
268buffer being the index of the next buffer in the list).
269A
270.Dv 0
271indicates the end of the list.
272The application is free to modify
273this list and use the buffers (i.e., binding them to the slots of a
274netmap ring).
275When closing the netmap file descriptor,
276the kernel frees the buffers contained in the list pointed by
277.Pa ni_bufs_head
278, irrespectively of the buffers originally provided by the kernel on
279.Em NIOCREGIF .
280.It Dv struct netmap_ring (one per ring )
281.Bd -literal
282struct netmap_ring {
283    ...
284    const uint32_t num_slots;   /* slots in each ring            */
285    const uint32_t nr_buf_size; /* size of each buffer           */
286    ...
287    uint32_t       head;        /* (u) first buf owned by user   */
288    uint32_t       cur;         /* (u) wakeup position           */
289    const uint32_t tail;        /* (k) first buf owned by kernel */
290    ...
291    uint32_t       flags;
292    struct timeval ts;          /* (k) time of last rxsync()     */
293    ...
294    struct netmap_slot slot[0]; /* array of slots                */
295}
296.Ed
297.Pp
298Implements transmit and receive rings, with read/write
299pointers, metadata and an array of
300.Em slots
301describing the buffers.
302.It Dv struct netmap_slot (one per buffer )
303.Bd -literal
304struct netmap_slot {
305    uint32_t buf_idx;           /* buffer index                 */
306    uint16_t len;               /* packet length                */
307    uint16_t flags;             /* buf changed, etc.            */
308    uint64_t ptr;               /* address for indirect buffers */
309};
310.Ed
311.Pp
312Describes a packet buffer, which normally is identified by
313an index and resides in the mmapped region.
314.It Dv packet buffers
315Fixed size (normally 2 KB) packet buffers allocated by the kernel.
316.El
317.Pp
318The offset of the
319.Pa struct netmap_if
320in the mmapped region is indicated by the
321.Pa nr_offset
322field in the structure returned by
323.Dv NIOCREGIF .
324From there, all other objects are reachable through
325relative references (offsets or indexes).
326Macros and functions in
327.In net/netmap_user.h
328help converting them into actual pointers:
329.Pp
330.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
331.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
332.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
333.Pp
334.Dl char *buf = NETMAP_BUF(ring, buffer_index);
335.Sh RINGS, BUFFERS AND DATA I/O
336.Va Rings
337are circular queues of packets with three indexes/pointers
338.Va ( head , cur , tail ) ;
339one slot is always kept empty.
340The ring size
341.Va ( num_slots )
342should not be assumed to be a power of two.
343.Pp
344.Va head
345is the first slot available to userspace;
346.Pp
347.Va cur
348is the wakeup point:
349select/poll will unblock when
350.Va tail
351passes
352.Va cur ;
353.Pp
354.Va tail
355is the first slot reserved to the kernel.
356.Pp
357Slot indexes
358.Em must
359only move forward;
360for convenience, the function
361.Dl nm_ring_next(ring, index)
362returns the next index modulo the ring size.
363.Pp
364.Va head
365and
366.Va cur
367are only modified by the user program;
368.Va tail
369is only modified by the kernel.
370The kernel only reads/writes the
371.Vt struct netmap_ring
372slots and buffers
373during the execution of a netmap-related system call.
374The only exception are slots (and buffers) in the range
375.Va tail\  . . . head-1 ,
376that are explicitly assigned to the kernel.
377.Ss TRANSMIT RINGS
378On transmit rings, after a
379.Nm
380system call, slots in the range
381.Va head\  . . . tail-1
382are available for transmission.
383User code should fill the slots sequentially
384and advance
385.Va head
386and
387.Va cur
388past slots ready to transmit.
389.Va cur
390may be moved further ahead if the user code needs
391more slots before further transmissions (see
392.Sx SCATTER GATHER I/O ) .
393.Pp
394At the next NIOCTXSYNC/select()/poll(),
395slots up to
396.Va head-1
397are pushed to the port, and
398.Va tail
399may advance if further slots have become available.
400Below is an example of the evolution of a TX ring:
401.Bd -literal
402    after the syscall, slots between cur and tail are (a)vailable
403              head=cur   tail
404               |          |
405               v          v
406     TX  [.....aaaaaaaaaaa.............]
407
408    user creates new packets to (T)ransmit
409                head=cur tail
410                    |     |
411                    v     v
412     TX  [.....TTTTTaaaaaa.............]
413
414    NIOCTXSYNC/poll()/select() sends packets and reports new slots
415                head=cur      tail
416                    |          |
417                    v          v
418     TX  [..........aaaaaaaaaaa........]
419.Ed
420.Pp
421.Fn select
422and
423.Fn poll
424will block if there is no space in the ring, i.e.,
425.Dl ring->cur == ring->tail
426and return when new slots have become available.
427.Pp
428High speed applications may want to amortize the cost of system calls
429by preparing as many packets as possible before issuing them.
430.Pp
431A transmit ring with pending transmissions has
432.Dl ring->head != ring->tail + 1 (modulo the ring size).
433The function
434.Va int nm_tx_pending(ring)
435implements this test.
436.Ss RECEIVE RINGS
437On receive rings, after a
438.Nm
439system call, the slots in the range
440.Va head\& . . . tail-1
441contain received packets.
442User code should process them and advance
443.Va head
444and
445.Va cur
446past slots it wants to return to the kernel.
447.Va cur
448may be moved further ahead if the user code wants to
449wait for more packets
450without returning all the previous slots to the kernel.
451.Pp
452At the next NIOCRXSYNC/select()/poll(),
453slots up to
454.Va head-1
455are returned to the kernel for further receives, and
456.Va tail
457may advance to report new incoming packets.
458.Pp
459Below is an example of the evolution of an RX ring:
460.Bd -literal
461    after the syscall, there are some (h)eld and some (R)eceived slots
462           head  cur     tail
463            |     |       |
464            v     v       v
465     RX  [..hhhhhhRRRRRRRR..........]
466
467    user advances head and cur, releasing some slots and holding others
468               head cur  tail
469                 |  |     |
470                 v  v     v
471     RX  [..*****hhhRRRRRR...........]
472
473    NICRXSYNC/poll()/select() recovers slots and reports new packets
474               head cur        tail
475                 |  |           |
476                 v  v           v
477     RX  [.......hhhRRRRRRRRRRRR....]
478.Ed
479.Sh SLOTS AND PACKET BUFFERS
480Normally, packets should be stored in the netmap-allocated buffers
481assigned to slots when ports are bound to a file descriptor.
482One packet is fully contained in a single buffer.
483.Pp
484The following flags affect slot and buffer processing:
485.Bl -tag -width XXX
486.It NS_BUF_CHANGED
487.Em must
488be used when the
489.Va buf_idx
490in the slot is changed.
491This can be used to implement
492zero-copy forwarding, see
493.Sx ZERO-COPY FORWARDING .
494.It NS_REPORT
495reports when this buffer has been transmitted.
496Normally,
497.Nm
498notifies transmit completions in batches, hence signals
499can be delayed indefinitely.
500This flag helps detect
501when packets have been sent and a file descriptor can be closed.
502.It NS_FORWARD
503When a ring is in 'transparent' mode,
504packets marked with this flag by the user application are forwarded to the
505other endpoint at the next system call, thus restoring (in a selective way)
506the connection between a NIC and the host stack.
507.It NS_NO_LEARN
508tells the forwarding code that the source MAC address for this
509packet must not be used in the learning bridge code.
510.It NS_INDIRECT
511indicates that the packet's payload is in a user-supplied buffer
512whose user virtual address is in the 'ptr' field of the slot.
513The size can reach 65535 bytes.
514.Pp
515This is only supported on the transmit ring of
516.Nm VALE
517ports, and it helps reducing data copies in the interconnection
518of virtual machines.
519.It NS_MOREFRAG
520indicates that the packet continues with subsequent buffers;
521the last buffer in a packet must have the flag clear.
522.El
523.Sh SCATTER GATHER I/O
524Packets can span multiple slots if the
525.Va NS_MOREFRAG
526flag is set in all but the last slot.
527The maximum length of a chain is 64 buffers.
528This is normally used with
529.Nm VALE
530ports when connecting virtual machines, as they generate large
531TSO segments that are not split unless they reach a physical device.
532.Pp
533NOTE: The length field always refers to the individual
534fragment; there is no place with the total length of a packet.
535.Pp
536On receive rings the macro
537.Va NS_RFRAGS(slot)
538indicates the remaining number of slots for this packet,
539including the current one.
540Slots with a value greater than 1 also have NS_MOREFRAG set.
541.Sh IOCTLS
542.Nm
543uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
544for non-blocking I/O.
545They take no argument.
546Two more ioctls (NIOCGINFO, NIOCREGIF) are used
547to query and configure ports, with the following argument:
548.Bd -literal
549struct nmreq {
550    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
551    uint32_t  nr_version;        /* (i) API version                */
552    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
553    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
554    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
555    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
556    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
557    uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
558    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
559    uint16_t  nr_cmd;            /* (i) special command            */
560    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
561    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
562    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
563    uint32_t  nr_flags           /* (i/o) open mode                */
564    ...
565};
566.Ed
567.Pp
568A file descriptor obtained through
569.Pa /dev/netmap
570also supports the ioctl supported by network devices, see
571.Xr netintro 4 .
572.Bl -tag -width XXXX
573.It Dv NIOCGINFO
574returns EINVAL if the named port does not support netmap.
575Otherwise, it returns 0 and (advisory) information
576about the port.
577Note that all the information below can change before the
578interface is actually put in netmap mode.
579.Bl -tag -width XX
580.It Pa nr_memsize
581indicates the size of the
582.Nm
583memory region.
584NICs in
585.Nm
586mode all share the same memory region,
587whereas
588.Nm VALE
589ports have independent regions for each port.
590.It Pa nr_tx_slots , nr_rx_slots
591indicate the size of transmit and receive rings.
592.It Pa nr_tx_rings , nr_rx_rings
593indicate the number of transmit
594and receive rings.
595Both ring number and sizes may be configured at runtime
596using interface-specific functions (e.g.,
597.Xr ethtool 8
598).
599.El
600.It Dv NIOCREGIF
601binds the port named in
602.Va nr_name
603to the file descriptor.
604For a physical device this also switches it into
605.Nm
606mode, disconnecting
607it from the host stack.
608Multiple file descriptors can be bound to the same port,
609with proper synchronization left to the user.
610.Pp
611The recommended way to bind a file descriptor to a port is
612to use function
613.Va nm_open(..)
614(see
615.Sx LIBRARIES )
616which parses names to access specific port types and
617enable features.
618In the following we document the main features.
619.Pp
620.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
621.Em netmap pipe ,
622consisting of two netmap ports with a crossover connection.
623A netmap pipe share the same memory space of the parent port,
624and is meant to enable configuration where a master process acts
625as a dispatcher towards slave processes.
626.Pp
627To enable this function, the
628.Pa nr_arg1
629field of the structure can be used as a hint to the kernel to
630indicate how many pipes we expect to use, and reserve extra space
631in the memory region.
632.Pp
633On return, it gives the same info as NIOCGINFO,
634with
635.Pa nr_ringid
636and
637.Pa nr_flags
638indicating the identity of the rings controlled through the file
639descriptor.
640.Pp
641.Va nr_flags
642.Va nr_ringid
643selects which rings are controlled through this file descriptor.
644Possible values of
645.Pa nr_flags
646are indicated below, together with the naming schemes
647that application libraries (such as the
648.Nm nm_open
649indicated below) can use to indicate the specific set of rings.
650In the example below, "netmap:foo" is any valid netmap port name.
651.Bl -tag -width XXXXX
652.It NR_REG_ALL_NIC                         "netmap:foo"
653(default) all hardware ring pairs
654.It NR_REG_SW            "netmap:foo^"
655the ``host rings'', connecting to the host stack.
656.It NR_REG_NIC_SW        "netmap:foo*"
657all hardware rings and the host rings
658.It NR_REG_ONE_NIC       "netmap:foo-i"
659only the i-th hardware ring pair, where the number is in
660.Pa nr_ringid ;
661.It NR_REG_PIPE_MASTER  "netmap:foo{i"
662the master side of the netmap pipe whose identifier (i) is in
663.Pa nr_ringid ;
664.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
665the slave side of the netmap pipe whose identifier (i) is in
666.Pa nr_ringid .
667.Pp
668The identifier of a pipe must be thought as part of the pipe name,
669and does not need to be sequential.
670On return the pipe
671will only have a single ring pair with index 0,
672irrespective of the value of
673.Va i .
674.El
675.Pp
676By default, a
677.Xr poll 2
678or
679.Xr select 2
680call pushes out any pending packets on the transmit ring, even if
681no write events are specified.
682The feature can be disabled by or-ing
683.Va NETMAP_NO_TX_POLL
684to the value written to
685.Va nr_ringid .
686When this feature is used,
687packets are transmitted only on
688.Va ioctl(NIOCTXSYNC)
689or
690.Va select() /
691.Va poll()
692are called with a write event (POLLOUT/wfdset) or a full ring.
693.Pp
694When registering a virtual interface that is dynamically created to a
695.Nm VALE
696switch, we can specify the desired number of rings (1 by default,
697and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
698.It Dv NIOCTXSYNC
699tells the hardware of new packets to transmit, and updates the
700number of slots available for transmission.
701.It Dv NIOCRXSYNC
702tells the hardware of consumed packets, and asks for newly available
703packets.
704.El
705.Sh SELECT, POLL, EPOLL, KQUEUE
706.Xr select 2
707and
708.Xr poll 2
709on a
710.Nm
711file descriptor process rings as indicated in
712.Sx TRANSMIT RINGS
713and
714.Sx RECEIVE RINGS ,
715respectively when write (POLLOUT) and read (POLLIN) events are requested.
716Both block if no slots are available in the ring
717.Va ( ring->cur == ring->tail ) .
718Depending on the platform,
719.Xr epoll 7
720and
721.Xr kqueue 2
722are supported too.
723.Pp
724Packets in transmit rings are normally pushed out
725(and buffers reclaimed) even without
726requesting write events.
727Passing the
728.Dv NETMAP_NO_TX_POLL
729flag to
730.Em NIOCREGIF
731disables this feature.
732By default, receive rings are processed only if read
733events are requested.
734Passing the
735.Dv NETMAP_DO_RX_POLL
736flag to
737.Em NIOCREGIF updates receive rings even without read events.
738Note that on
739.Xr epoll 7
740and
741.Xr kqueue 2 ,
742.Dv NETMAP_NO_TX_POLL
743and
744.Dv NETMAP_DO_RX_POLL
745only have an effect when some event is posted for the file descriptor.
746.Sh LIBRARIES
747The
748.Nm
749API is supposed to be used directly, both because of its simplicity and
750for efficient integration with applications.
751.Pp
752For convenience, the
753.In net/netmap_user.h
754header provides a few macros and functions to ease creating
755a file descriptor and doing I/O with a
756.Nm
757port.
758These are loosely modeled after the
759.Xr pcap 3
760API, to ease porting of libpcap-based applications to
761.Nm .
762To use these extra functions, programs should
763.Dl #define NETMAP_WITH_LIBS
764before
765.Dl #include <net/netmap_user.h>
766.Pp
767The following functions are available:
768.Bl -tag -width XXXXX
769.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg )
770similar to
771.Xr pcap_open_live 3 ,
772binds a file descriptor to a port.
773.Bl -tag -width XX
774.It Va ifname
775is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
776.Nm VALE
777port.
778.It Va req
779provides the initial values for the argument to the NIOCREGIF ioctl.
780The nm_flags and nm_ringid values are overwritten by parsing
781ifname and flags, and other fields can be overridden through
782the other two arguments.
783.It Va arg
784points to a struct nm_desc containing arguments (e.g., from a previously
785open file descriptor) that should override the defaults.
786The fields are used as described below
787.It Va flags
788can be set to a combination of the following flags:
789.Va NETMAP_NO_TX_POLL ,
790.Va NETMAP_DO_RX_POLL
791(copied into nr_ringid);
792.Va NM_OPEN_NO_MMAP
793(if arg points to the same memory region,
794avoids the mmap and uses the values from it);
795.Va NM_OPEN_IFNAME
796(ignores ifname and uses the values in arg);
797.Va NM_OPEN_ARG1 ,
798.Va NM_OPEN_ARG2 ,
799.Va NM_OPEN_ARG3
800(uses the fields from arg);
801.Va NM_OPEN_RING_CFG
802(uses the ring number and sizes from arg).
803.El
804.It Va int nm_close(struct nm_desc *d )
805closes the file descriptor, unmaps memory, frees resources.
806.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size )
807similar to
808.Va pcap_inject() ,
809pushes a packet to a ring, returns the size
810of the packet is successful, or 0 on error;
811.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg )
812similar to
813.Va pcap_dispatch() ,
814applies a callback to incoming packets
815.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr )
816similar to
817.Va pcap_next() ,
818fetches the next packet
819.El
820.Sh SUPPORTED DEVICES
821.Nm
822natively supports the following devices:
823.Pp
824On
825.Fx :
826.Xr cxgbe 4 ,
827.Xr em 4 ,
828.Xr iflib 4
829.Pq providing Xr igb 4 and Xr em 4 ,
830.Xr ixgbe 4 ,
831.Xr ixl 4 ,
832.Xr re 4 ,
833.Xr vtnet 4 .
834.Pp
835On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
836.Pp
837NICs without native support can still be used in
838.Nm
839mode through emulation.
840Performance is inferior to native netmap
841mode but still significantly higher than various raw socket types
842(bpf, PF_PACKET, etc.).
843Note that for slow devices (such as 1 Gbit/s and slower NICs,
844or several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
845emulated and native mode will likely have similar or same throughput.
846.Pp
847When emulation is in use, packet sniffer programs such as tcpdump
848could see received packets before they are diverted by netmap.
849This behaviour is not intentional, being just an artifact of the implementation
850of emulation.
851Note that in case the netmap application subsequently moves packets received
852from the emulated adapter onto the host RX ring, the sniffer will intercept
853those packets again, since the packets are injected to the host stack as they
854were received by the network interface.
855.Pp
856Emulation is also available for devices with native netmap support,
857which can be used for testing or performance comparison.
858The sysctl variable
859.Va dev.netmap.admode
860globally controls how netmap mode is implemented.
861.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
862Some aspects of the operation of
863.Nm
864and
865.Nm VALE
866are controlled through sysctl variables on
867.Fx
868.Em ( dev.netmap.* )
869and module parameters on Linux
870.Em ( /sys/module/netmap/parameters/* ) :
871.Bl -tag -width indent
872.It Va dev.netmap.admode: 0
873Controls the use of native or emulated adapter mode.
874.Pp
8750 uses the best available option;
876.Pp
8771 forces native mode and fails if not available;
878.Pp
8792 forces emulated hence never fails.
880.It Va dev.netmap.generic_rings: 1
881Number of rings used for emulated netmap mode
882.It Va dev.netmap.generic_ringsize: 1024
883Ring size used for emulated netmap mode
884.It Va dev.netmap.generic_mit: 100000
885Controls interrupt moderation for emulated mode
886.It Va dev.netmap.fwd: 0
887Forces NS_FORWARD mode
888.It Va dev.netmap.txsync_retry: 2
889Number of txsync loops in the
890.Nm VALE
891flush function
892.It Va dev.netmap.no_pendintr: 1
893Forces recovery of transmit buffers on system calls
894.It Va dev.netmap.no_timestamp: 0
895Disables the update of the timestamp in the netmap ring
896.It Va dev.netmap.verbose: 0
897Verbose kernel messages
898.It Va dev.netmap.buf_num: 163840
899.It Va dev.netmap.buf_size: 2048
900.It Va dev.netmap.ring_num: 200
901.It Va dev.netmap.ring_size: 36864
902.It Va dev.netmap.if_num: 100
903.It Va dev.netmap.if_size: 1024
904Sizes and number of objects (netmap_if, netmap_ring, buffers)
905for the global memory region.
906The only parameter worth modifying is
907.Va dev.netmap.buf_num
908as it impacts the total amount of memory used by netmap.
909.It Va dev.netmap.buf_curr_num: 0
910.It Va dev.netmap.buf_curr_size: 0
911.It Va dev.netmap.ring_curr_num: 0
912.It Va dev.netmap.ring_curr_size: 0
913.It Va dev.netmap.if_curr_num: 0
914.It Va dev.netmap.if_curr_size: 0
915Actual values in use.
916.It Va dev.netmap.priv_buf_num: 4098
917.It Va dev.netmap.priv_buf_size: 2048
918.It Va dev.netmap.priv_ring_num: 4
919.It Va dev.netmap.priv_ring_size: 20480
920.It Va dev.netmap.priv_if_num: 2
921.It Va dev.netmap.priv_if_size: 1024
922Sizes and number of objects (netmap_if, netmap_ring, buffers)
923for private memory regions.
924A separate memory region is used for each
925.Nm VALE
926port and each pair of
927.Nm netmap pipes .
928.It Va dev.netmap.bridge_batch: 1024
929Batch size used when moving packets across a
930.Nm VALE
931switch.
932Values above 64 generally guarantee good
933performance.
934.It Va dev.netmap.max_bridges: 8
935Max number of
936.Nm VALE
937switches that can be created. This tunable can be specified
938at loader time.
939.It Va dev.netmap.ptnet_vnet_hdr: 1
940Allow ptnet devices to use virtio-net headers
941.El
942.Sh SYSTEM CALLS
943.Nm
944uses
945.Xr select 2 ,
946.Xr poll 2 ,
947.Xr epoll 7
948and
949.Xr kqueue 2
950to wake up processes when significant events occur, and
951.Xr mmap 2
952to map memory.
953.Xr ioctl 2
954is used to configure ports and
955.Nm VALE switches .
956.Pp
957Applications may need to create threads and bind them to
958specific cores to improve performance, using standard
959OS primitives, see
960.Xr pthread 3 .
961In particular,
962.Xr pthread_setaffinity_np 3
963may be of use.
964.Sh EXAMPLES
965.Ss TEST PROGRAMS
966.Nm
967comes with a few programs that can be used for testing or
968simple applications.
969See the
970.Pa examples/
971directory in
972.Nm
973distributions, or
974.Pa tools/tools/netmap/
975directory in
976.Fx
977distributions.
978.Pp
979.Xr pkt-gen 8
980is a general purpose traffic source/sink.
981.Pp
982As an example
983.Dl pkt-gen -i ix0 -f tx -l 60
984can generate an infinite stream of minimum size packets, and
985.Dl pkt-gen -i ix0 -f rx
986is a traffic sink.
987Both print traffic statistics, to help monitor
988how the system performs.
989.Pp
990.Xr pkt-gen 8
991has many options can be uses to set packet sizes, addresses,
992rates, and use multiple send/receive threads and cores.
993.Pp
994.Xr bridge 4
995is another test program which interconnects two
996.Nm
997ports.
998It can be used for transparent forwarding between
999interfaces, as in
1000.Dl bridge -i netmap:ix0 -i netmap:ix1
1001or even connect the NIC to the host stack using netmap
1002.Dl bridge -i netmap:ix0
1003.Ss USING THE NATIVE API
1004The following code implements a traffic generator:
1005.Pp
1006.Bd -literal -compact
1007#include <net/netmap_user.h>
1008\&...
1009void sender(void)
1010{
1011    struct netmap_if *nifp;
1012    struct netmap_ring *ring;
1013    struct nmreq nmr;
1014    struct pollfd fds;
1015
1016    fd = open("/dev/netmap", O_RDWR);
1017    bzero(&nmr, sizeof(nmr));
1018    strcpy(nmr.nr_name, "ix0");
1019    nmr.nm_version = NETMAP_API;
1020    ioctl(fd, NIOCREGIF, &nmr);
1021    p = mmap(0, nmr.nr_memsize, fd);
1022    nifp = NETMAP_IF(p, nmr.nr_offset);
1023    ring = NETMAP_TXRING(nifp, 0);
1024    fds.fd = fd;
1025    fds.events = POLLOUT;
1026    for (;;) {
1027	poll(&fds, 1, -1);
1028	while (!nm_ring_empty(ring)) {
1029	    i = ring->cur;
1030	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
1031	    ... prepare packet in buf ...
1032	    ring->slot[i].len = ... packet length ...
1033	    ring->head = ring->cur = nm_ring_next(ring, i);
1034	}
1035    }
1036}
1037.Ed
1038.Ss HELPER FUNCTIONS
1039A simple receiver can be implemented using the helper functions:
1040.Pp
1041.Bd -literal -compact
1042#define NETMAP_WITH_LIBS
1043#include <net/netmap_user.h>
1044\&...
1045void receiver(void)
1046{
1047    struct nm_desc *d;
1048    struct pollfd fds;
1049    u_char *buf;
1050    struct nm_pkthdr h;
1051    ...
1052    d = nm_open("netmap:ix0", NULL, 0, 0);
1053    fds.fd = NETMAP_FD(d);
1054    fds.events = POLLIN;
1055    for (;;) {
1056	poll(&fds, 1, -1);
1057        while ( (buf = nm_nextpkt(d, &h)) )
1058	    consume_pkt(buf, h.len);
1059    }
1060    nm_close(d);
1061}
1062.Ed
1063.Ss ZERO-COPY FORWARDING
1064Since physical interfaces share the same memory region,
1065it is possible to do packet forwarding between ports
1066swapping buffers.
1067The buffer from the transmit ring is used
1068to replenish the receive ring:
1069.Pp
1070.Bd -literal -compact
1071    uint32_t tmp;
1072    struct netmap_slot *src, *dst;
1073    ...
1074    src = &src_ring->slot[rxr->cur];
1075    dst = &dst_ring->slot[txr->cur];
1076    tmp = dst->buf_idx;
1077    dst->buf_idx = src->buf_idx;
1078    dst->len = src->len;
1079    dst->flags = NS_BUF_CHANGED;
1080    src->buf_idx = tmp;
1081    src->flags = NS_BUF_CHANGED;
1082    rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1083    txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1084    ...
1085.Ed
1086.Ss ACCESSING THE HOST STACK
1087The host stack is for all practical purposes just a regular ring pair,
1088which you can access with the netmap API (e.g., with
1089.Dl nm_open("netmap:eth0^", ... ) ;
1090All packets that the host would send to an interface in
1091.Nm
1092mode end up into the RX ring, whereas all packets queued to the
1093TX ring are send up to the host stack.
1094.Ss VALE SWITCH
1095A simple way to test the performance of a
1096.Nm VALE
1097switch is to attach a sender and a receiver to it,
1098e.g., running the following in two different terminals:
1099.Dl pkt-gen -i vale1:a -f rx # receiver
1100.Dl pkt-gen -i vale1:b -f tx # sender
1101The same example can be used to test netmap pipes, by simply
1102changing port names, e.g.,
1103.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
1104.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
1105.Pp
1106The following command attaches an interface and the host stack
1107to a switch:
1108.Dl valectl -h vale2:em0
1109Other
1110.Nm
1111clients attached to the same switch can now communicate
1112with the network card or the host.
1113.Sh SEE ALSO
1114.Xr vale 4 ,
1115.Xr bridge 8 ,
1116.Xr valectl 8 ,
1117.Xr lb 8 ,
1118.Xr nmreplay 8 ,
1119.Xr pkt-gen 8
1120.Pp
1121.Pa http://info.iet.unipi.it/~luigi/netmap/
1122.Pp
1123Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1124Communications of the ACM, 55 (3), pp.45-51, March 2012
1125.Pp
1126Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1127Usenix ATC'12, June 2012, Boston
1128.Pp
1129Luigi Rizzo, Giuseppe Lettieri,
1130VALE, a switched ethernet for virtual machines,
1131ACM CoNEXT'12, December 2012, Nice
1132.Pp
1133Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1134Speeding up packet I/O in virtual machines,
1135ACM/IEEE ANCS'13, October 2013, San Jose
1136.Sh AUTHORS
1137.An -nosplit
1138The
1139.Nm
1140framework has been originally designed and implemented at the
1141Universita` di Pisa in 2011 by
1142.An Luigi Rizzo ,
1143and further extended with help from
1144.An Matteo Landi ,
1145.An Gaetano Catalli ,
1146.An Giuseppe Lettieri ,
1147and
1148.An Vincenzo Maffione .
1149.Pp
1150.Nm
1151and
1152.Nm VALE
1153have been funded by the European Commission within FP7 Projects
1154CHANGE (257422) and OPENLAB (287581).
1155.Sh CAVEATS
1156No matter how fast the CPU and OS are,
1157achieving line rate on 10G and faster interfaces
1158requires hardware with sufficient performance.
1159Several NICs are unable to sustain line rate with
1160small packet sizes.
1161Insufficient PCIe or memory bandwidth
1162can also cause reduced performance.
1163.Pp
1164Another frequent reason for low performance is the use
1165of flow control on the link: a slow receiver can limit
1166the transmit speed.
1167Be sure to disable flow control when running high
1168speed experiments.
1169.Ss SPECIAL NIC FEATURES
1170.Nm
1171is orthogonal to some NIC features such as
1172multiqueue, schedulers, packet filters.
1173.Pp
1174Multiple transmit and receive rings are supported natively
1175and can be configured with ordinary OS tools,
1176such as
1177.Xr ethtool 8
1178or
1179device-specific sysctl variables.
1180The same goes for Receive Packet Steering (RPS)
1181and filtering of incoming traffic.
1182.Pp
1183.Nm
1184.Em does not use
1185features such as
1186.Em checksum offloading , TCP segmentation offloading ,
1187.Em encryption , VLAN encapsulation/decapsulation ,
1188etc.
1189When using netmap to exchange packets with the host stack,
1190make sure to disable these features.
1191