xref: /freebsd/share/man/man4/bpf.4 (revision a0409676120c1e558d0ade943019934e0f15118d)
1.\" Copyright (c) 2007 Seccuris Inc.
2.\" All rights reserved.
3.\"
4.\" This software was developed by Robert N. M. Watson under contract to
5.\" Seccuris Inc.
6.\"
7.\" Redistribution and use in source and binary forms, with or without
8.\" modification, are permitted provided that the following conditions
9.\" are met:
10.\" 1. Redistributions of source code must retain the above copyright
11.\"    notice, this list of conditions and the following disclaimer.
12.\" 2. Redistributions in binary form must reproduce the above copyright
13.\"    notice, this list of conditions and the following disclaimer in the
14.\"    documentation and/or other materials provided with the distribution.
15.\"
16.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
19.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
26.\" SUCH DAMAGE.
27.\"
28.\" Copyright (c) 1990 The Regents of the University of California.
29.\" All rights reserved.
30.\"
31.\" Redistribution and use in source and binary forms, with or without
32.\" modification, are permitted provided that: (1) source code distributions
33.\" retain the above copyright notice and this paragraph in its entirety, (2)
34.\" distributions including binary code include the above copyright notice and
35.\" this paragraph in its entirety in the documentation or other materials
36.\" provided with the distribution, and (3) all advertising materials mentioning
37.\" features or use of this software display the following acknowledgement:
38.\" ``This product includes software developed by the University of California,
39.\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of
40.\" the University nor the names of its contributors may be used to endorse
41.\" or promote products derived from this software without specific prior
42.\" written permission.
43.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
44.\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
45.\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
46.\"
47.\" This document is derived in part from the enet man page (enet.4)
48.\" distributed with 4.3BSD Unix.
49.\"
50.\" $FreeBSD$
51.\"
52.Dd October 9, 2020
53.Dt BPF 4
54.Os
55.Sh NAME
56.Nm bpf
57.Nd Berkeley Packet Filter
58.Sh SYNOPSIS
59.Cd device bpf
60.Sh DESCRIPTION
61The Berkeley Packet Filter
62provides a raw interface to data link layers in a protocol
63independent fashion.
64All packets on the network, even those destined for other hosts,
65are accessible through this mechanism.
66.Pp
67The packet filter appears as a character special device,
68.Pa /dev/bpf .
69After opening the device, the file descriptor must be bound to a
70specific network interface with the
71.Dv BIOCSETIF
72ioctl.
73A given interface can be shared by multiple listeners, and the filter
74underlying each descriptor will see an identical packet stream.
75.Pp
76Associated with each open instance of a
77.Nm
78file is a user-settable packet filter.
79Whenever a packet is received by an interface,
80all file descriptors listening on that interface apply their filter.
81Each descriptor that accepts the packet receives its own copy.
82.Pp
83A packet can be sent out on the network by writing to a
84.Nm
85file descriptor.
86The writes are unbuffered, meaning only one packet can be processed per write.
87Currently, only writes to Ethernets and
88.Tn SLIP
89links are supported.
90.Sh BUFFER MODES
91.Nm
92devices deliver packet data to the application via memory buffers provided by
93the application.
94The buffer mode is set using the
95.Dv BIOCSETBUFMODE
96ioctl, and read using the
97.Dv BIOCGETBUFMODE
98ioctl.
99.Ss Buffered read mode
100By default,
101.Nm
102devices operate in the
103.Dv BPF_BUFMODE_BUFFER
104mode, in which packet data is copied explicitly from kernel to user memory
105using the
106.Xr read 2
107system call.
108The user process will declare a fixed buffer size that will be used both for
109sizing internal buffers and for all
110.Xr read 2
111operations on the file.
112This size is queried using the
113.Dv BIOCGBLEN
114ioctl, and is set using the
115.Dv BIOCSBLEN
116ioctl.
117Note that an individual packet larger than the buffer size is necessarily
118truncated.
119.Ss Zero-copy buffer mode
120.Nm
121devices may also operate in the
122.Dv BPF_BUFMODE_ZEROCOPY
123mode, in which packet data is written directly into two user memory buffers
124by the kernel, avoiding both system call and copying overhead.
125Buffers are of fixed (and equal) size, page-aligned, and an even multiple of
126the page size.
127The maximum zero-copy buffer size is returned by the
128.Dv BIOCGETZMAX
129ioctl.
130Note that an individual packet larger than the buffer size is necessarily
131truncated.
132.Pp
133The user process registers two memory buffers using the
134.Dv BIOCSETZBUF
135ioctl, which accepts a
136.Vt struct bpf_zbuf
137pointer as an argument:
138.Bd -literal
139struct bpf_zbuf {
140	void *bz_bufa;
141	void *bz_bufb;
142	size_t bz_buflen;
143};
144.Ed
145.Pp
146.Vt bz_bufa
147is a pointer to the userspace address of the first buffer that will be
148filled, and
149.Vt bz_bufb
150is a pointer to the second buffer.
151.Nm
152will then cycle between the two buffers as they fill and are acknowledged.
153.Pp
154Each buffer begins with a fixed-length header to hold synchronization and
155data length information for the buffer:
156.Bd -literal
157struct bpf_zbuf_header {
158	volatile u_int  bzh_kernel_gen;	/* Kernel generation number. */
159	volatile u_int  bzh_kernel_len;	/* Length of data in the buffer. */
160	volatile u_int  bzh_user_gen;	/* User generation number. */
161	/* ...padding for future use... */
162};
163.Ed
164.Pp
165The header structure of each buffer, including all padding, should be zeroed
166before it is configured using
167.Dv BIOCSETZBUF .
168Remaining space in the buffer will be used by the kernel to store packet
169data, laid out in the same format as with buffered read mode.
170.Pp
171The kernel and the user process follow a simple acknowledgement protocol via
172the buffer header to synchronize access to the buffer: when the header
173generation numbers,
174.Vt bzh_kernel_gen
175and
176.Vt bzh_user_gen ,
177hold the same value, the kernel owns the buffer, and when they differ,
178userspace owns the buffer.
179.Pp
180While the kernel owns the buffer, the contents are unstable and may change
181asynchronously; while the user process owns the buffer, its contents are
182stable and will not be changed until the buffer has been acknowledged.
183.Pp
184Initializing the buffer headers to all 0's before registering the buffer has
185the effect of assigning initial ownership of both buffers to the kernel.
186The kernel signals that a buffer has been assigned to userspace by modifying
187.Vt bzh_kernel_gen ,
188and userspace acknowledges the buffer and returns it to the kernel by setting
189the value of
190.Vt bzh_user_gen
191to the value of
192.Vt bzh_kernel_gen .
193.Pp
194In order to avoid caching and memory re-ordering effects, the user process
195must use atomic operations and memory barriers when checking for and
196acknowledging buffers:
197.Bd -literal
198#include <machine/atomic.h>
199
200/*
201 * Return ownership of a buffer to the kernel for reuse.
202 */
203static void
204buffer_acknowledge(struct bpf_zbuf_header *bzh)
205{
206
207	atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen);
208}
209
210/*
211 * Check whether a buffer has been assigned to userspace by the kernel.
212 * Return true if userspace owns the buffer, and false otherwise.
213 */
214static int
215buffer_check(struct bpf_zbuf_header *bzh)
216{
217
218	return (bzh->bzh_user_gen !=
219	    atomic_load_acq_int(&bzh->bzh_kernel_gen));
220}
221.Ed
222.Pp
223The user process may force the assignment of the next buffer, if any data
224is pending, to userspace using the
225.Dv BIOCROTZBUF
226ioctl.
227This allows the user process to retrieve data in a partially filled buffer
228before the buffer is full, such as following a timeout; the process must
229recheck for buffer ownership using the header generation numbers, as the
230buffer will not be assigned to userspace if no data was present.
231.Pp
232As in the buffered read mode,
233.Xr kqueue 2 ,
234.Xr poll 2 ,
235and
236.Xr select 2
237may be used to sleep awaiting the availability of a completed buffer.
238They will return a readable file descriptor when ownership of the next buffer
239is assigned to user space.
240.Pp
241In the current implementation, the kernel may assign zero, one, or both
242buffers to the user process; however, an earlier implementation maintained
243the invariant that at most one buffer could be assigned to the user process
244at a time.
245In order to both ensure progress and high performance, user processes should
246acknowledge a completely processed buffer as quickly as possible, returning
247it for reuse, and not block waiting on a second buffer while holding another
248buffer.
249.Sh IOCTLS
250The
251.Xr ioctl 2
252command codes below are defined in
253.In net/bpf.h .
254All commands require
255these includes:
256.Bd -literal
257	#include <sys/types.h>
258	#include <sys/time.h>
259	#include <sys/ioctl.h>
260	#include <net/bpf.h>
261.Ed
262.Pp
263Additionally,
264.Dv BIOCGETIF
265and
266.Dv BIOCSETIF
267require
268.In sys/socket.h
269and
270.In net/if.h .
271.Pp
272In addition to
273.Dv FIONREAD
274the following commands may be applied to any open
275.Nm
276file.
277The (third) argument to
278.Xr ioctl 2
279should be a pointer to the type indicated.
280.Bl -tag -width BIOCGETBUFMODE
281.It Dv BIOCGBLEN
282.Pq Li u_int
283Returns the required buffer length for reads on
284.Nm
285files.
286.It Dv BIOCSBLEN
287.Pq Li u_int
288Sets the buffer length for reads on
289.Nm
290files.
291The buffer must be set before the file is attached to an interface
292with
293.Dv BIOCSETIF .
294If the requested buffer size cannot be accommodated, the closest
295allowable size will be set and returned in the argument.
296A read call will result in
297.Er EINVAL
298if it is passed a buffer that is not this size.
299.It Dv BIOCGDLT
300.Pq Li u_int
301Returns the type of the data link layer underlying the attached interface.
302.Er EINVAL
303is returned if no interface has been specified.
304The device types, prefixed with
305.Dq Li DLT_ ,
306are defined in
307.In net/bpf.h .
308.It Dv BIOCGDLTLIST
309.Pq Li "struct bpf_dltlist"
310Returns an array of the available types of the data link layer
311underlying the attached interface:
312.Bd -literal -offset indent
313struct bpf_dltlist {
314	u_int bfl_len;
315	u_int *bfl_list;
316};
317.Ed
318.Pp
319The available types are returned in the array pointed to by the
320.Va bfl_list
321field while their length in u_int is supplied to the
322.Va bfl_len
323field.
324.Er ENOMEM
325is returned if there is not enough buffer space and
326.Er EFAULT
327is returned if a bad address is encountered.
328The
329.Va bfl_len
330field is modified on return to indicate the actual length in u_int
331of the array returned.
332If
333.Va bfl_list
334is
335.Dv NULL ,
336the
337.Va bfl_len
338field is set to indicate the required length of an array in u_int.
339.It Dv BIOCSDLT
340.Pq Li u_int
341Changes the type of the data link layer underlying the attached interface.
342.Er EINVAL
343is returned if no interface has been specified or the specified
344type is not available for the interface.
345.It Dv BIOCPROMISC
346Forces the interface into promiscuous mode.
347All packets, not just those destined for the local host, are processed.
348Since more than one file can be listening on a given interface,
349a listener that opened its interface non-promiscuously may receive
350packets promiscuously.
351This problem can be remedied with an appropriate filter.
352.Pp
353The interface remains in promiscuous mode until all files listening
354promiscuously are closed.
355.It Dv BIOCFLUSH
356Flushes the buffer of incoming packets,
357and resets the statistics that are returned by BIOCGSTATS.
358.It Dv BIOCGETIF
359.Pq Li "struct ifreq"
360Returns the name of the hardware interface that the file is listening on.
361The name is returned in the ifr_name field of
362the
363.Li ifreq
364structure.
365All other fields are undefined.
366.It Dv BIOCSETIF
367.Pq Li "struct ifreq"
368Sets the hardware interface associated with the file.
369This
370command must be performed before any packets can be read.
371The device is indicated by name using the
372.Li ifr_name
373field of the
374.Li ifreq
375structure.
376Additionally, performs the actions of
377.Dv BIOCFLUSH .
378.It Dv BIOCSRTIMEOUT
379.It Dv BIOCGRTIMEOUT
380.Pq Li "struct timeval"
381Sets or gets the read timeout parameter.
382The argument
383specifies the length of time to wait before timing
384out on a read request.
385This parameter is initialized to zero by
386.Xr open 2 ,
387indicating no timeout.
388.It Dv BIOCGSTATS
389.Pq Li "struct bpf_stat"
390Returns the following structure of packet statistics:
391.Bd -literal
392struct bpf_stat {
393	u_int bs_recv;    /* number of packets received */
394	u_int bs_drop;    /* number of packets dropped */
395};
396.Ed
397.Pp
398The fields are:
399.Bl -hang -offset indent
400.It Li bs_recv
401the number of packets received by the descriptor since opened or reset
402(including any buffered since the last read call);
403and
404.It Li bs_drop
405the number of packets which were accepted by the filter but dropped by the
406kernel because of buffer overflows
407(i.e., the application's reads are not keeping up with the packet traffic).
408.El
409.It Dv BIOCIMMEDIATE
410.Pq Li u_int
411Enables or disables
412.Dq immediate mode ,
413based on the truth value of the argument.
414When immediate mode is enabled, reads return immediately upon packet
415reception.
416Otherwise, a read will block until either the kernel buffer
417becomes full or a timeout occurs.
418This is useful for programs like
419.Xr rarpd 8
420which must respond to messages in real time.
421The default for a new file is off.
422.It Dv BIOCSETF
423.It Dv BIOCSETFNR
424.Pq Li "struct bpf_program"
425Sets the read filter program used by the kernel to discard uninteresting
426packets.
427An array of instructions and its length is passed in using
428the following structure:
429.Bd -literal
430struct bpf_program {
431	u_int bf_len;
432	struct bpf_insn *bf_insns;
433};
434.Ed
435.Pp
436The filter program is pointed to by the
437.Li bf_insns
438field while its length in units of
439.Sq Li struct bpf_insn
440is given by the
441.Li bf_len
442field.
443See section
444.Sx "FILTER MACHINE"
445for an explanation of the filter language.
446The only difference between
447.Dv BIOCSETF
448and
449.Dv BIOCSETFNR
450is
451.Dv BIOCSETF
452performs the actions of
453.Dv BIOCFLUSH
454while
455.Dv BIOCSETFNR
456does not.
457.It Dv BIOCSETWF
458.Pq Li "struct bpf_program"
459Sets the write filter program used by the kernel to control what type of
460packets can be written to the interface.
461See the
462.Dv BIOCSETF
463command for more
464information on the
465.Nm
466filter program.
467.It Dv BIOCVERSION
468.Pq Li "struct bpf_version"
469Returns the major and minor version numbers of the filter language currently
470recognized by the kernel.
471Before installing a filter, applications must check
472that the current version is compatible with the running kernel.
473Version numbers are compatible if the major numbers match and the application minor
474is less than or equal to the kernel minor.
475The kernel version number is returned in the following structure:
476.Bd -literal
477struct bpf_version {
478        u_short bv_major;
479        u_short bv_minor;
480};
481.Ed
482.Pp
483The current version numbers are given by
484.Dv BPF_MAJOR_VERSION
485and
486.Dv BPF_MINOR_VERSION
487from
488.In net/bpf.h .
489An incompatible filter
490may result in undefined behavior (most likely, an error returned by
491.Fn ioctl
492or haphazard packet matching).
493.It Dv BIOCGRSIG
494.It Dv BIOCSRSIG
495.Pq Li u_int
496Sets or gets the receive signal.
497This signal will be sent to the process or process group specified by
498.Dv FIOSETOWN .
499It defaults to
500.Dv SIGIO .
501.It Dv BIOCSHDRCMPLT
502.It Dv BIOCGHDRCMPLT
503.Pq Li u_int
504Sets or gets the status of the
505.Dq header complete
506flag.
507Set to zero if the link level source address should be filled in automatically
508by the interface output routine.
509Set to one if the link level source
510address will be written, as provided, to the wire.
511This flag is initialized to zero by default.
512.It Dv BIOCSSEESENT
513.It Dv BIOCGSEESENT
514.Pq Li u_int
515These commands are obsolete but left for compatibility.
516Use
517.Dv BIOCSDIRECTION
518and
519.Dv BIOCGDIRECTION
520instead.
521Sets or gets the flag determining whether locally generated packets on the
522interface should be returned by BPF.
523Set to zero to see only incoming packets on the interface.
524Set to one to see packets originating locally and remotely on the interface.
525This flag is initialized to one by default.
526.It Dv BIOCSDIRECTION
527.It Dv BIOCGDIRECTION
528.Pq Li u_int
529Sets or gets the setting determining whether incoming, outgoing, or all packets
530on the interface should be returned by BPF.
531Set to
532.Dv BPF_D_IN
533to see only incoming packets on the interface.
534Set to
535.Dv BPF_D_INOUT
536to see packets originating locally and remotely on the interface.
537Set to
538.Dv BPF_D_OUT
539to see only outgoing packets on the interface.
540This setting is initialized to
541.Dv BPF_D_INOUT
542by default.
543.It Dv BIOCSTSTAMP
544.It Dv BIOCGTSTAMP
545.Pq Li u_int
546Set or get format and resolution of the time stamps returned by BPF.
547Set to
548.Dv BPF_T_MICROTIME ,
549.Dv BPF_T_MICROTIME_FAST ,
550.Dv BPF_T_MICROTIME_MONOTONIC ,
551or
552.Dv BPF_T_MICROTIME_MONOTONIC_FAST
553to get time stamps in 64-bit
554.Vt struct timeval
555format.
556Set to
557.Dv BPF_T_NANOTIME ,
558.Dv BPF_T_NANOTIME_FAST ,
559.Dv BPF_T_NANOTIME_MONOTONIC ,
560or
561.Dv BPF_T_NANOTIME_MONOTONIC_FAST
562to get time stamps in 64-bit
563.Vt struct timespec
564format.
565Set to
566.Dv BPF_T_BINTIME ,
567.Dv BPF_T_BINTIME_FAST ,
568.Dv BPF_T_NANOTIME_MONOTONIC ,
569or
570.Dv BPF_T_BINTIME_MONOTONIC_FAST
571to get time stamps in 64-bit
572.Vt struct bintime
573format.
574Set to
575.Dv BPF_T_NONE
576to ignore time stamp.
577All 64-bit time stamp formats are wrapped in
578.Vt struct bpf_ts .
579The
580.Dv BPF_T_MICROTIME_FAST ,
581.Dv BPF_T_NANOTIME_FAST ,
582.Dv BPF_T_BINTIME_FAST ,
583.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
584.Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
585and
586.Dv BPF_T_BINTIME_MONOTONIC_FAST
587are analogs of corresponding formats without _FAST suffix but do not perform
588a full time counter query, so their accuracy is one timer tick.
589The
590.Dv BPF_T_MICROTIME_MONOTONIC ,
591.Dv BPF_T_NANOTIME_MONOTONIC ,
592.Dv BPF_T_BINTIME_MONOTONIC ,
593.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
594.Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
595and
596.Dv BPF_T_BINTIME_MONOTONIC_FAST
597store the time elapsed since kernel boot.
598This setting is initialized to
599.Dv BPF_T_MICROTIME
600by default.
601.It Dv BIOCFEEDBACK
602.Pq Li u_int
603Set packet feedback mode.
604This allows injected packets to be fed back as input to the interface when
605output via the interface is successful.
606When
607.Dv BPF_D_INOUT
608direction is set, injected outgoing packet is not returned by BPF to avoid
609duplication.
610This flag is initialized to zero by default.
611.It Dv BIOCLOCK
612Set the locked flag on the
613.Nm
614descriptor.
615This prevents the execution of
616ioctl commands which could change the underlying operating parameters of
617the device.
618.It Dv BIOCGETBUFMODE
619.It Dv BIOCSETBUFMODE
620.Pq Li u_int
621Get or set the current
622.Nm
623buffering mode; possible values are
624.Dv BPF_BUFMODE_BUFFER ,
625buffered read mode, and
626.Dv BPF_BUFMODE_ZBUF ,
627zero-copy buffer mode.
628.It Dv BIOCSETZBUF
629.Pq Li struct bpf_zbuf
630Set the current zero-copy buffer locations; buffer locations may be
631set only once zero-copy buffer mode has been selected, and prior to attaching
632to an interface.
633Buffers must be of identical size, page-aligned, and an integer multiple of
634pages in size.
635The three fields
636.Vt bz_bufa ,
637.Vt bz_bufb ,
638and
639.Vt bz_buflen
640must be filled out.
641If buffers have already been set for this device, the ioctl will fail.
642.It Dv BIOCGETZMAX
643.Pq Li size_t
644Get the largest individual zero-copy buffer size allowed.
645As two buffers are used in zero-copy buffer mode, the limit (in practice) is
646twice the returned size.
647As zero-copy buffers consume kernel address space, conservative selection of
648buffer size is suggested, especially when there are multiple
649.Nm
650descriptors in use on 32-bit systems.
651.It Dv BIOCROTZBUF
652Force ownership of the next buffer to be assigned to userspace, if any data
653present in the buffer.
654If no data is present, the buffer will remain owned by the kernel.
655This allows consumers of zero-copy buffering to implement timeouts and
656retrieve partially filled buffers.
657In order to handle the case where no data is present in the buffer and
658therefore ownership is not assigned, the user process must check
659.Vt bzh_kernel_gen
660against
661.Vt bzh_user_gen .
662.El
663.Sh STANDARD IOCTLS
664.Nm
665now supports several standard
666.Xr ioctl 2 Ns 's
667which allow the user to do async and/or non-blocking I/O to an open
668.I bpf
669file descriptor.
670.Bl -tag -width SIOCGIFADDR
671.It Dv FIONREAD
672.Pq Li int
673Returns the number of bytes that are immediately available for reading.
674.It Dv SIOCGIFADDR
675.Pq Li "struct ifreq"
676Returns the address associated with the interface.
677.It Dv FIONBIO
678.Pq Li int
679Sets or clears non-blocking I/O.
680If arg is non-zero, then doing a
681.Xr read 2
682when no data is available will return -1 and
683.Va errno
684will be set to
685.Er EAGAIN .
686If arg is zero, non-blocking I/O is disabled.
687Note: setting this overrides the timeout set by
688.Dv BIOCSRTIMEOUT .
689.It Dv FIOASYNC
690.Pq Li int
691Enables or disables async I/O.
692When enabled (arg is non-zero), the process or process group specified by
693.Dv FIOSETOWN
694will start receiving
695.Dv SIGIO 's
696when packets arrive.
697Note that you must do an
698.Dv FIOSETOWN
699in order for this to take affect,
700as the system will not default this for you.
701The signal may be changed via
702.Dv BIOCSRSIG .
703.It Dv FIOSETOWN
704.It Dv FIOGETOWN
705.Pq Li int
706Sets or gets the process or process group (if negative) that should
707receive
708.Dv SIGIO
709when packets are available.
710The signal may be changed using
711.Dv BIOCSRSIG
712(see above).
713.El
714.Sh BPF HEADER
715One of the following structures is prepended to each packet returned by
716.Xr read 2
717or via a zero-copy buffer:
718.Bd -literal
719struct bpf_xhdr {
720	struct bpf_ts	bh_tstamp;     /* time stamp */
721	uint32_t	bh_caplen;     /* length of captured portion */
722	uint32_t	bh_datalen;    /* original length of packet */
723	u_short		bh_hdrlen;     /* length of bpf header (this struct
724					  plus alignment padding) */
725};
726
727struct bpf_hdr {
728	struct timeval	bh_tstamp;     /* time stamp */
729	uint32_t	bh_caplen;     /* length of captured portion */
730	uint32_t	bh_datalen;    /* original length of packet */
731	u_short		bh_hdrlen;     /* length of bpf header (this struct
732					  plus alignment padding) */
733};
734.Ed
735.Pp
736The fields, whose values are stored in host order, and are:
737.Pp
738.Bl -tag -compact -width bh_datalen
739.It Li bh_tstamp
740The time at which the packet was processed by the packet filter.
741.It Li bh_caplen
742The length of the captured portion of the packet.
743This is the minimum of
744the truncation amount specified by the filter and the length of the packet.
745.It Li bh_datalen
746The length of the packet off the wire.
747This value is independent of the truncation amount specified by the filter.
748.It Li bh_hdrlen
749The length of the
750.Nm
751header, which may not be equal to
752.\" XXX - not really a function call
753.Fn sizeof "struct bpf_xhdr"
754or
755.Fn sizeof "struct bpf_hdr" .
756.El
757.Pp
758The
759.Li bh_hdrlen
760field exists to account for
761padding between the header and the link level protocol.
762The purpose here is to guarantee proper alignment of the packet
763data structures, which is required on alignment sensitive
764architectures and improves performance on many other architectures.
765The packet filter ensures that the
766.Vt bpf_xhdr ,
767.Vt bpf_hdr
768and the network layer
769header will be word aligned.
770Currently,
771.Vt bpf_hdr
772is used when the time stamp is set to
773.Dv BPF_T_MICROTIME ,
774.Dv BPF_T_MICROTIME_FAST ,
775.Dv BPF_T_MICROTIME_MONOTONIC ,
776.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
777or
778.Dv BPF_T_NONE
779for backward compatibility reasons.
780Otherwise,
781.Vt bpf_xhdr
782is used.
783However,
784.Vt bpf_hdr
785may be deprecated in the near future.
786Suitable precautions
787must be taken when accessing the link layer protocol fields on alignment
788restricted machines.
789(This is not a problem on an Ethernet, since
790the type field is a short falling on an even offset,
791and the addresses are probably accessed in a bytewise fashion).
792.Pp
793Additionally, individual packets are padded so that each starts
794on a word boundary.
795This requires that an application
796has some knowledge of how to get from packet to packet.
797The macro
798.Dv BPF_WORDALIGN
799is defined in
800.In net/bpf.h
801to facilitate
802this process.
803It rounds up its argument to the nearest word aligned value (where a word is
804.Dv BPF_ALIGNMENT
805bytes wide).
806.Pp
807For example, if
808.Sq Li p
809points to the start of a packet, this expression
810will advance it to the next packet:
811.Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen)
812.Pp
813For the alignment mechanisms to work properly, the
814buffer passed to
815.Xr read 2
816must itself be word aligned.
817The
818.Xr malloc 3
819function
820will always return an aligned buffer.
821.Sh FILTER MACHINE
822A filter program is an array of instructions, with all branches forwardly
823directed, terminated by a
824.Em return
825instruction.
826Each instruction performs some action on the pseudo-machine state,
827which consists of an accumulator, index register, scratch memory store,
828and implicit program counter.
829.Pp
830The following structure defines the instruction format:
831.Bd -literal
832struct bpf_insn {
833	u_short     code;
834	u_char      jt;
835	u_char      jf;
836	bpf_u_int32 k;
837};
838.Ed
839.Pp
840The
841.Li k
842field is used in different ways by different instructions,
843and the
844.Li jt
845and
846.Li jf
847fields are used as offsets
848by the branch instructions.
849The opcodes are encoded in a semi-hierarchical fashion.
850There are eight classes of instructions:
851.Dv BPF_LD ,
852.Dv BPF_LDX ,
853.Dv BPF_ST ,
854.Dv BPF_STX ,
855.Dv BPF_ALU ,
856.Dv BPF_JMP ,
857.Dv BPF_RET ,
858and
859.Dv BPF_MISC .
860Various other mode and
861operator bits are or'd into the class to give the actual instructions.
862The classes and modes are defined in
863.In net/bpf.h .
864.Pp
865Below are the semantics for each defined
866.Nm
867instruction.
868We use the convention that A is the accumulator, X is the index register,
869P[] packet data, and M[] scratch memory store.
870P[i:n] gives the data at byte offset
871.Dq i
872in the packet,
873interpreted as a word (n=4),
874unsigned halfword (n=2), or unsigned byte (n=1).
875M[i] gives the i'th word in the scratch memory store, which is only
876addressed in word units.
877The memory store is indexed from 0 to
878.Dv BPF_MEMWORDS
879- 1.
880.Li k ,
881.Li jt ,
882and
883.Li jf
884are the corresponding fields in the
885instruction definition.
886.Dq len
887refers to the length of the packet.
888.Bl -tag -width BPF_STXx
889.It Dv BPF_LD
890These instructions copy a value into the accumulator.
891The type of the source operand is specified by an
892.Dq addressing mode
893and can be a constant
894.Pq Dv BPF_IMM ,
895packet data at a fixed offset
896.Pq Dv BPF_ABS ,
897packet data at a variable offset
898.Pq Dv BPF_IND ,
899the packet length
900.Pq Dv BPF_LEN ,
901or a word in the scratch memory store
902.Pq Dv BPF_MEM .
903For
904.Dv BPF_IND
905and
906.Dv BPF_ABS ,
907the data size must be specified as a word
908.Pq Dv BPF_W ,
909halfword
910.Pq Dv BPF_H ,
911or byte
912.Pq Dv BPF_B .
913The semantics of all the recognized
914.Dv BPF_LD
915instructions follow.
916.Bd -literal
917BPF_LD+BPF_W+BPF_ABS	A <- P[k:4]
918BPF_LD+BPF_H+BPF_ABS	A <- P[k:2]
919BPF_LD+BPF_B+BPF_ABS	A <- P[k:1]
920BPF_LD+BPF_W+BPF_IND	A <- P[X+k:4]
921BPF_LD+BPF_H+BPF_IND	A <- P[X+k:2]
922BPF_LD+BPF_B+BPF_IND	A <- P[X+k:1]
923BPF_LD+BPF_W+BPF_LEN	A <- len
924BPF_LD+BPF_IMM		A <- k
925BPF_LD+BPF_MEM		A <- M[k]
926.Ed
927.It Dv BPF_LDX
928These instructions load a value into the index register.
929Note that
930the addressing modes are more restrictive than those of the accumulator loads,
931but they include
932.Dv BPF_MSH ,
933a hack for efficiently loading the IP header length.
934.Bd -literal
935BPF_LDX+BPF_W+BPF_IMM	X <- k
936BPF_LDX+BPF_W+BPF_MEM	X <- M[k]
937BPF_LDX+BPF_W+BPF_LEN	X <- len
938BPF_LDX+BPF_B+BPF_MSH	X <- 4*(P[k:1]&0xf)
939.Ed
940.It Dv BPF_ST
941This instruction stores the accumulator into the scratch memory.
942We do not need an addressing mode since there is only one possibility
943for the destination.
944.Bd -literal
945BPF_ST			M[k] <- A
946.Ed
947.It Dv BPF_STX
948This instruction stores the index register in the scratch memory store.
949.Bd -literal
950BPF_STX			M[k] <- X
951.Ed
952.It Dv BPF_ALU
953The alu instructions perform operations between the accumulator and
954index register or constant, and store the result back in the accumulator.
955For binary operations, a source mode is required
956.Dv ( BPF_K
957or
958.Dv BPF_X ) .
959.Bd -literal
960BPF_ALU+BPF_ADD+BPF_K	A <- A + k
961BPF_ALU+BPF_SUB+BPF_K	A <- A - k
962BPF_ALU+BPF_MUL+BPF_K	A <- A * k
963BPF_ALU+BPF_DIV+BPF_K	A <- A / k
964BPF_ALU+BPF_MOD+BPF_K	A <- A % k
965BPF_ALU+BPF_AND+BPF_K	A <- A & k
966BPF_ALU+BPF_OR+BPF_K	A <- A | k
967BPF_ALU+BPF_XOR+BPF_K	A <- A ^ k
968BPF_ALU+BPF_LSH+BPF_K	A <- A << k
969BPF_ALU+BPF_RSH+BPF_K	A <- A >> k
970BPF_ALU+BPF_ADD+BPF_X	A <- A + X
971BPF_ALU+BPF_SUB+BPF_X	A <- A - X
972BPF_ALU+BPF_MUL+BPF_X	A <- A * X
973BPF_ALU+BPF_DIV+BPF_X	A <- A / X
974BPF_ALU+BPF_MOD+BPF_X	A <- A % X
975BPF_ALU+BPF_AND+BPF_X	A <- A & X
976BPF_ALU+BPF_OR+BPF_X	A <- A | X
977BPF_ALU+BPF_XOR+BPF_X	A <- A ^ X
978BPF_ALU+BPF_LSH+BPF_X	A <- A << X
979BPF_ALU+BPF_RSH+BPF_X	A <- A >> X
980BPF_ALU+BPF_NEG		A <- -A
981.Ed
982.It Dv BPF_JMP
983The jump instructions alter flow of control.
984Conditional jumps
985compare the accumulator against a constant
986.Pq Dv BPF_K
987or the index register
988.Pq Dv BPF_X .
989If the result is true (or non-zero),
990the true branch is taken, otherwise the false branch is taken.
991Jump offsets are encoded in 8 bits so the longest jump is 256 instructions.
992However, the jump always
993.Pq Dv BPF_JA
994opcode uses the 32 bit
995.Li k
996field as the offset, allowing arbitrarily distant destinations.
997All conditionals use unsigned comparison conventions.
998.Bd -literal
999BPF_JMP+BPF_JA		pc += k
1000BPF_JMP+BPF_JGT+BPF_K	pc += (A > k) ? jt : jf
1001BPF_JMP+BPF_JGE+BPF_K	pc += (A >= k) ? jt : jf
1002BPF_JMP+BPF_JEQ+BPF_K	pc += (A == k) ? jt : jf
1003BPF_JMP+BPF_JSET+BPF_K	pc += (A & k) ? jt : jf
1004BPF_JMP+BPF_JGT+BPF_X	pc += (A > X) ? jt : jf
1005BPF_JMP+BPF_JGE+BPF_X	pc += (A >= X) ? jt : jf
1006BPF_JMP+BPF_JEQ+BPF_X	pc += (A == X) ? jt : jf
1007BPF_JMP+BPF_JSET+BPF_X	pc += (A & X) ? jt : jf
1008.Ed
1009.It Dv BPF_RET
1010The return instructions terminate the filter program and specify the amount
1011of packet to accept (i.e., they return the truncation amount).
1012A return value of zero indicates that the packet should be ignored.
1013The return value is either a constant
1014.Pq Dv BPF_K
1015or the accumulator
1016.Pq Dv BPF_A .
1017.Bd -literal
1018BPF_RET+BPF_A		accept A bytes
1019BPF_RET+BPF_K		accept k bytes
1020.Ed
1021.It Dv BPF_MISC
1022The miscellaneous category was created for anything that does not
1023fit into the above classes, and for any new instructions that might need to
1024be added.
1025Currently, these are the register transfer instructions
1026that copy the index register to the accumulator or vice versa.
1027.Bd -literal
1028BPF_MISC+BPF_TAX	X <- A
1029BPF_MISC+BPF_TXA	A <- X
1030.Ed
1031.El
1032.Pp
1033The
1034.Nm
1035interface provides the following macros to facilitate
1036array initializers:
1037.Fn BPF_STMT opcode operand
1038and
1039.Fn BPF_JUMP opcode operand true_offset false_offset .
1040.Sh SYSCTL VARIABLES
1041A set of
1042.Xr sysctl 8
1043variables controls the behaviour of the
1044.Nm
1045subsystem
1046.Bl -tag -width indent
1047.It Va net.bpf.optimize_writers : No 0
1048Various programs use BPF to send (but not receive) raw packets
1049(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs).
1050They do not need incoming packets to be send to them.
1051Turning this option on
1052makes new BPF users to be attached to write-only interface list until program
1053explicitly specifies read filter via
1054.Fn pcap_set_filter .
1055This removes any performance degradation for high-speed interfaces.
1056.It Va net.bpf.stats :
1057Binary interface for retrieving general statistics.
1058.It Va net.bpf.zerocopy_enable : No 0
1059Permits zero-copy to be used with net BPF readers.
1060Use with caution.
1061.It Va net.bpf.maxinsns : No 512
1062Maximum number of instructions that BPF program can contain.
1063Use
1064.Xr tcpdump 1
1065.Fl d
1066option to determine approximate number of instruction for any filter.
1067.It Va net.bpf.maxbufsize : No 524288
1068Maximum buffer size to allocate for packets buffer.
1069.It Va net.bpf.bufsize : No 4096
1070Default buffer size to allocate for packets buffer.
1071.El
1072.Sh EXAMPLES
1073The following filter is taken from the Reverse ARP Daemon.
1074It accepts only Reverse ARP requests.
1075.Bd -literal
1076struct bpf_insn insns[] = {
1077	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1078	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3),
1079	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
1080	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1),
1081	BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
1082		 sizeof(struct ether_header)),
1083	BPF_STMT(BPF_RET+BPF_K, 0),
1084};
1085.Ed
1086.Pp
1087This filter accepts only IP packets between host 128.3.112.15 and
1088128.3.112.35.
1089.Bd -literal
1090struct bpf_insn insns[] = {
1091	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1092	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8),
1093	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26),
1094	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2),
1095	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
1096	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4),
1097	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3),
1098	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
1099	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1),
1100	BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
1101	BPF_STMT(BPF_RET+BPF_K, 0),
1102};
1103.Ed
1104.Pp
1105Finally, this filter returns only TCP finger packets.
1106We must parse the IP header to reach the TCP header.
1107The
1108.Dv BPF_JSET
1109instruction
1110checks that the IP fragment offset is 0 so we are sure
1111that we have a TCP header.
1112.Bd -literal
1113struct bpf_insn insns[] = {
1114	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1115	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10),
1116	BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23),
1117	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8),
1118	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
1119	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0),
1120	BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14),
1121	BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14),
1122	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0),
1123	BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16),
1124	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1),
1125	BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
1126	BPF_STMT(BPF_RET+BPF_K, 0),
1127};
1128.Ed
1129.Sh SEE ALSO
1130.Xr tcpdump 1 ,
1131.Xr ioctl 2 ,
1132.Xr kqueue 2 ,
1133.Xr poll 2 ,
1134.Xr select 2 ,
1135.Xr ng_bpf 4 ,
1136.Xr bpf 9
1137.Rs
1138.%A McCanne, S.
1139.%A Jacobson V.
1140.%T "An efficient, extensible, and portable network monitor"
1141.Re
1142.Sh HISTORY
1143The Enet packet filter was created in 1980 by Mike Accetta and
1144Rick Rashid at Carnegie-Mellon University.
1145Jeffrey Mogul, at
1146Stanford, ported the code to
1147.Bx
1148and continued its development from
11491983 on.
1150Since then, it has evolved into the Ultrix Packet Filter at
1151.Tn DEC ,
1152a
1153.Tn STREAMS
1154.Tn NIT
1155module under
1156.Tn SunOS 4.1 ,
1157and
1158.Tn BPF .
1159.Sh AUTHORS
1160.An -nosplit
1161.An Steven McCanne ,
1162of Lawrence Berkeley Laboratory, implemented BPF in
1163Summer 1990.
1164Much of the design is due to
1165.An Van Jacobson .
1166.Pp
1167Support for zero-copy buffers was added by
1168.An Robert N. M. Watson
1169under contract to Seccuris Inc.
1170.Sh BUGS
1171The read buffer must be of a fixed size (returned by the
1172.Dv BIOCGBLEN
1173ioctl).
1174.Pp
1175A file that does not request promiscuous mode may receive promiscuously
1176received packets as a side effect of another file requesting this
1177mode on the same hardware interface.
1178This could be fixed in the kernel with additional processing overhead.
1179However, we favor the model where
1180all files must assume that the interface is promiscuous, and if
1181so desired, must utilize a filter to reject foreign packets.
1182.Pp
1183The
1184.Dv SEESENT ,
1185.Dv DIRECTION ,
1186and
1187.Dv FEEDBACK
1188settings have been observed to work incorrectly on some interface
1189types, including those with hardware loopback rather than software loopback,
1190and point-to-point interfaces.
1191They appear to function correctly on a
1192broad range of Ethernet-style interfaces.
1193