xref: /freebsd/share/man/man4/bpf.4 (revision 5c8e8e82aeaf3aa788acdd6cfca30ef09094230d)
1.\" Copyright (c) 2007 Seccuris Inc.
2.\" All rights reserved.
3.\"
4.\" This software was developed by Robert N. M. Watson under contract to
5.\" Seccuris Inc.
6.\"
7.\" Redistribution and use in source and binary forms, with or without
8.\" modification, are permitted provided that the following conditions
9.\" are met:
10.\" 1. Redistributions of source code must retain the above copyright
11.\"    notice, this list of conditions and the following disclaimer.
12.\" 2. Redistributions in binary form must reproduce the above copyright
13.\"    notice, this list of conditions and the following disclaimer in the
14.\"    documentation and/or other materials provided with the distribution.
15.\"
16.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
19.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
26.\" SUCH DAMAGE.
27.\"
28.\" Copyright (c) 1990 The Regents of the University of California.
29.\" All rights reserved.
30.\"
31.\" Redistribution and use in source and binary forms, with or without
32.\" modification, are permitted provided that: (1) source code distributions
33.\" retain the above copyright notice and this paragraph in its entirety, (2)
34.\" distributions including binary code include the above copyright notice and
35.\" this paragraph in its entirety in the documentation or other materials
36.\" provided with the distribution, and (3) all advertising materials mentioning
37.\" features or use of this software display the following acknowledgement:
38.\" ``This product includes software developed by the University of California,
39.\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of
40.\" the University nor the names of its contributors may be used to endorse
41.\" or promote products derived from this software without specific prior
42.\" written permission.
43.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
44.\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
45.\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
46.\"
47.\" This document is derived in part from the enet man page (enet.4)
48.\" distributed with 4.3BSD Unix.
49.\"
50.\" $FreeBSD$
51.\"
52.Dd July 22, 2021
53.Dt BPF 4
54.Os
55.Sh NAME
56.Nm bpf
57.Nd Berkeley Packet Filter
58.Sh SYNOPSIS
59.Cd device bpf
60.Sh DESCRIPTION
61The Berkeley Packet Filter
62provides a raw interface to data link layers in a protocol
63independent fashion.
64All packets on the network, even those destined for other hosts,
65are accessible through this mechanism.
66.Pp
67The packet filter appears as a character special device,
68.Pa /dev/bpf .
69After opening the device, the file descriptor must be bound to a
70specific network interface with the
71.Dv BIOCSETIF
72ioctl.
73A given interface can be shared by multiple listeners, and the filter
74underlying each descriptor will see an identical packet stream.
75.Pp
76Associated with each open instance of a
77.Nm
78file is a user-settable packet filter.
79Whenever a packet is received by an interface,
80all file descriptors listening on that interface apply their filter.
81Each descriptor that accepts the packet receives its own copy.
82.Pp
83A packet can be sent out on the network by writing to a
84.Nm
85file descriptor.
86The writes are unbuffered, meaning only one packet can be processed per write.
87Currently, only writes to Ethernets and
88.Tn SLIP
89links are supported.
90.Sh BUFFER MODES
91.Nm
92devices deliver packet data to the application via memory buffers provided by
93the application.
94The buffer mode is set using the
95.Dv BIOCSETBUFMODE
96ioctl, and read using the
97.Dv BIOCGETBUFMODE
98ioctl.
99.Ss Buffered read mode
100By default,
101.Nm
102devices operate in the
103.Dv BPF_BUFMODE_BUFFER
104mode, in which packet data is copied explicitly from kernel to user memory
105using the
106.Xr read 2
107system call.
108The user process will declare a fixed buffer size that will be used both for
109sizing internal buffers and for all
110.Xr read 2
111operations on the file.
112This size is queried using the
113.Dv BIOCGBLEN
114ioctl, and is set using the
115.Dv BIOCSBLEN
116ioctl.
117Note that an individual packet larger than the buffer size is necessarily
118truncated.
119.Ss Zero-copy buffer mode
120.Nm
121devices may also operate in the
122.Dv BPF_BUFMODE_ZEROCOPY
123mode, in which packet data is written directly into two user memory buffers
124by the kernel, avoiding both system call and copying overhead.
125Buffers are of fixed (and equal) size, page-aligned, and an even multiple of
126the page size.
127The maximum zero-copy buffer size is returned by the
128.Dv BIOCGETZMAX
129ioctl.
130Note that an individual packet larger than the buffer size is necessarily
131truncated.
132.Pp
133The user process registers two memory buffers using the
134.Dv BIOCSETZBUF
135ioctl, which accepts a
136.Vt struct bpf_zbuf
137pointer as an argument:
138.Bd -literal
139struct bpf_zbuf {
140	void *bz_bufa;
141	void *bz_bufb;
142	size_t bz_buflen;
143};
144.Ed
145.Pp
146.Vt bz_bufa
147is a pointer to the userspace address of the first buffer that will be
148filled, and
149.Vt bz_bufb
150is a pointer to the second buffer.
151.Nm
152will then cycle between the two buffers as they fill and are acknowledged.
153.Pp
154Each buffer begins with a fixed-length header to hold synchronization and
155data length information for the buffer:
156.Bd -literal
157struct bpf_zbuf_header {
158	volatile u_int  bzh_kernel_gen;	/* Kernel generation number. */
159	volatile u_int  bzh_kernel_len;	/* Length of data in the buffer. */
160	volatile u_int  bzh_user_gen;	/* User generation number. */
161	/* ...padding for future use... */
162};
163.Ed
164.Pp
165The header structure of each buffer, including all padding, should be zeroed
166before it is configured using
167.Dv BIOCSETZBUF .
168Remaining space in the buffer will be used by the kernel to store packet
169data, laid out in the same format as with buffered read mode.
170.Pp
171The kernel and the user process follow a simple acknowledgement protocol via
172the buffer header to synchronize access to the buffer: when the header
173generation numbers,
174.Vt bzh_kernel_gen
175and
176.Vt bzh_user_gen ,
177hold the same value, the kernel owns the buffer, and when they differ,
178userspace owns the buffer.
179.Pp
180While the kernel owns the buffer, the contents are unstable and may change
181asynchronously; while the user process owns the buffer, its contents are
182stable and will not be changed until the buffer has been acknowledged.
183.Pp
184Initializing the buffer headers to all 0's before registering the buffer has
185the effect of assigning initial ownership of both buffers to the kernel.
186The kernel signals that a buffer has been assigned to userspace by modifying
187.Vt bzh_kernel_gen ,
188and userspace acknowledges the buffer and returns it to the kernel by setting
189the value of
190.Vt bzh_user_gen
191to the value of
192.Vt bzh_kernel_gen .
193.Pp
194In order to avoid caching and memory re-ordering effects, the user process
195must use atomic operations and memory barriers when checking for and
196acknowledging buffers:
197.Bd -literal
198#include <machine/atomic.h>
199
200/*
201 * Return ownership of a buffer to the kernel for reuse.
202 */
203static void
204buffer_acknowledge(struct bpf_zbuf_header *bzh)
205{
206
207	atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen);
208}
209
210/*
211 * Check whether a buffer has been assigned to userspace by the kernel.
212 * Return true if userspace owns the buffer, and false otherwise.
213 */
214static int
215buffer_check(struct bpf_zbuf_header *bzh)
216{
217
218	return (bzh->bzh_user_gen !=
219	    atomic_load_acq_int(&bzh->bzh_kernel_gen));
220}
221.Ed
222.Pp
223The user process may force the assignment of the next buffer, if any data
224is pending, to userspace using the
225.Dv BIOCROTZBUF
226ioctl.
227This allows the user process to retrieve data in a partially filled buffer
228before the buffer is full, such as following a timeout; the process must
229recheck for buffer ownership using the header generation numbers, as the
230buffer will not be assigned to userspace if no data was present.
231.Pp
232As in the buffered read mode,
233.Xr kqueue 2 ,
234.Xr poll 2 ,
235and
236.Xr select 2
237may be used to sleep awaiting the availability of a completed buffer.
238They will return a readable file descriptor when ownership of the next buffer
239is assigned to user space.
240.Pp
241In the current implementation, the kernel may assign zero, one, or both
242buffers to the user process; however, an earlier implementation maintained
243the invariant that at most one buffer could be assigned to the user process
244at a time.
245In order to both ensure progress and high performance, user processes should
246acknowledge a completely processed buffer as quickly as possible, returning
247it for reuse, and not block waiting on a second buffer while holding another
248buffer.
249.Sh IOCTLS
250The
251.Xr ioctl 2
252command codes below are defined in
253.In net/bpf.h .
254All commands require
255these includes:
256.Bd -literal
257	#include <sys/types.h>
258	#include <sys/time.h>
259	#include <sys/ioctl.h>
260	#include <net/bpf.h>
261.Ed
262.Pp
263Additionally,
264.Dv BIOCGETIF
265and
266.Dv BIOCSETIF
267require
268.In sys/socket.h
269and
270.In net/if.h .
271.Pp
272In addition to
273.Dv FIONREAD
274the following commands may be applied to any open
275.Nm
276file.
277The (third) argument to
278.Xr ioctl 2
279should be a pointer to the type indicated.
280.Bl -tag -width BIOCGETBUFMODE
281.It Dv BIOCGBLEN
282.Pq Li u_int
283Returns the required buffer length for reads on
284.Nm
285files.
286.It Dv BIOCSBLEN
287.Pq Li u_int
288Sets the buffer length for reads on
289.Nm
290files.
291The buffer must be set before the file is attached to an interface
292with
293.Dv BIOCSETIF .
294If the requested buffer size cannot be accommodated, the closest
295allowable size will be set and returned in the argument.
296A read call will result in
297.Er EINVAL
298if it is passed a buffer that is not this size.
299.It Dv BIOCGDLT
300.Pq Li u_int
301Returns the type of the data link layer underlying the attached interface.
302.Er EINVAL
303is returned if no interface has been specified.
304The device types, prefixed with
305.Dq Li DLT_ ,
306are defined in
307.In net/bpf.h .
308.It Dv BIOCGDLTLIST
309.Pq Li "struct bpf_dltlist"
310Returns an array of the available types of the data link layer
311underlying the attached interface:
312.Bd -literal -offset indent
313struct bpf_dltlist {
314	u_int bfl_len;
315	u_int *bfl_list;
316};
317.Ed
318.Pp
319The available types are returned in the array pointed to by the
320.Va bfl_list
321field while their length in u_int is supplied to the
322.Va bfl_len
323field.
324.Er ENOMEM
325is returned if there is not enough buffer space and
326.Er EFAULT
327is returned if a bad address is encountered.
328The
329.Va bfl_len
330field is modified on return to indicate the actual length in u_int
331of the array returned.
332If
333.Va bfl_list
334is
335.Dv NULL ,
336the
337.Va bfl_len
338field is set to indicate the required length of an array in u_int.
339.It Dv BIOCSDLT
340.Pq Li u_int
341Changes the type of the data link layer underlying the attached interface.
342.Er EINVAL
343is returned if no interface has been specified or the specified
344type is not available for the interface.
345.It Dv BIOCPROMISC
346Forces the interface into promiscuous mode.
347All packets, not just those destined for the local host, are processed.
348Since more than one file can be listening on a given interface,
349a listener that opened its interface non-promiscuously may receive
350packets promiscuously.
351This problem can be remedied with an appropriate filter.
352.Pp
353The interface remains in promiscuous mode until all files listening
354promiscuously are closed.
355.It Dv BIOCFLUSH
356Flushes the buffer of incoming packets,
357and resets the statistics that are returned by BIOCGSTATS.
358.It Dv BIOCGETIF
359.Pq Li "struct ifreq"
360Returns the name of the hardware interface that the file is listening on.
361The name is returned in the ifr_name field of
362the
363.Li ifreq
364structure.
365All other fields are undefined.
366.It Dv BIOCSETIF
367.Pq Li "struct ifreq"
368Sets the hardware interface associated with the file.
369This
370command must be performed before any packets can be read.
371The device is indicated by name using the
372.Li ifr_name
373field of the
374.Li ifreq
375structure.
376Additionally, performs the actions of
377.Dv BIOCFLUSH .
378.It Dv BIOCSRTIMEOUT
379.It Dv BIOCGRTIMEOUT
380.Pq Li "struct timeval"
381Sets or gets the read timeout parameter.
382The argument
383specifies the length of time to wait before timing
384out on a read request.
385This parameter is initialized to zero by
386.Xr open 2 ,
387indicating no timeout.
388.It Dv BIOCGSTATS
389.Pq Li "struct bpf_stat"
390Returns the following structure of packet statistics:
391.Bd -literal
392struct bpf_stat {
393	u_int bs_recv;    /* number of packets received */
394	u_int bs_drop;    /* number of packets dropped */
395};
396.Ed
397.Pp
398The fields are:
399.Bl -hang -offset indent
400.It Li bs_recv
401the number of packets received by the descriptor since opened or reset
402(including any buffered since the last read call);
403and
404.It Li bs_drop
405the number of packets which were accepted by the filter but dropped by the
406kernel because of buffer overflows
407(i.e., the application's reads are not keeping up with the packet traffic).
408.El
409.It Dv BIOCIMMEDIATE
410.Pq Li u_int
411Enables or disables
412.Dq immediate mode ,
413based on the truth value of the argument.
414When immediate mode is enabled, reads return immediately upon packet
415reception.
416Otherwise, a read will block until either the kernel buffer
417becomes full or a timeout occurs.
418This is useful for programs like
419.Xr rarpd 8
420which must respond to messages in real time.
421The default for a new file is off.
422.It Dv BIOCSETF
423.It Dv BIOCSETFNR
424.Pq Li "struct bpf_program"
425Sets the read filter program used by the kernel to discard uninteresting
426packets.
427An array of instructions and its length is passed in using
428the following structure:
429.Bd -literal
430struct bpf_program {
431	u_int bf_len;
432	struct bpf_insn *bf_insns;
433};
434.Ed
435.Pp
436The filter program is pointed to by the
437.Li bf_insns
438field while its length in units of
439.Sq Li struct bpf_insn
440is given by the
441.Li bf_len
442field.
443See section
444.Sx "FILTER MACHINE"
445for an explanation of the filter language.
446The only difference between
447.Dv BIOCSETF
448and
449.Dv BIOCSETFNR
450is
451.Dv BIOCSETF
452performs the actions of
453.Dv BIOCFLUSH
454while
455.Dv BIOCSETFNR
456does not.
457.It Dv BIOCSETWF
458.Pq Li "struct bpf_program"
459Sets the write filter program used by the kernel to control what type of
460packets can be written to the interface.
461See the
462.Dv BIOCSETF
463command for more
464information on the
465.Nm
466filter program.
467.It Dv BIOCVERSION
468.Pq Li "struct bpf_version"
469Returns the major and minor version numbers of the filter language currently
470recognized by the kernel.
471Before installing a filter, applications must check
472that the current version is compatible with the running kernel.
473Version numbers are compatible if the major numbers match and the application minor
474is less than or equal to the kernel minor.
475The kernel version number is returned in the following structure:
476.Bd -literal
477struct bpf_version {
478        u_short bv_major;
479        u_short bv_minor;
480};
481.Ed
482.Pp
483The current version numbers are given by
484.Dv BPF_MAJOR_VERSION
485and
486.Dv BPF_MINOR_VERSION
487from
488.In net/bpf.h .
489An incompatible filter
490may result in undefined behavior (most likely, an error returned by
491.Fn ioctl
492or haphazard packet matching).
493.It Dv BIOCGRSIG
494.It Dv BIOCSRSIG
495.Pq Li u_int
496Sets or gets the receive signal.
497This signal will be sent to the process or process group specified by
498.Dv FIOSETOWN .
499It defaults to
500.Dv SIGIO .
501.It Dv BIOCSHDRCMPLT
502.It Dv BIOCGHDRCMPLT
503.Pq Li u_int
504Sets or gets the status of the
505.Dq header complete
506flag.
507Set to zero if the link level source address should be filled in automatically
508by the interface output routine.
509Set to one if the link level source
510address will be written, as provided, to the wire.
511This flag is initialized to zero by default.
512.It Dv BIOCSSEESENT
513.It Dv BIOCGSEESENT
514.Pq Li u_int
515These commands are obsolete but left for compatibility.
516Use
517.Dv BIOCSDIRECTION
518and
519.Dv BIOCGDIRECTION
520instead.
521Sets or gets the flag determining whether locally generated packets on the
522interface should be returned by BPF.
523Set to zero to see only incoming packets on the interface.
524Set to one to see packets originating locally and remotely on the interface.
525This flag is initialized to one by default.
526.It Dv BIOCSDIRECTION
527.It Dv BIOCGDIRECTION
528.Pq Li u_int
529Sets or gets the setting determining whether incoming, outgoing, or all packets
530on the interface should be returned by BPF.
531Set to
532.Dv BPF_D_IN
533to see only incoming packets on the interface.
534Set to
535.Dv BPF_D_INOUT
536to see packets originating locally and remotely on the interface.
537Set to
538.Dv BPF_D_OUT
539to see only outgoing packets on the interface.
540This setting is initialized to
541.Dv BPF_D_INOUT
542by default.
543.It Dv BIOCSTSTAMP
544.It Dv BIOCGTSTAMP
545.Pq Li u_int
546Set or get format and resolution of the time stamps returned by BPF.
547Set to
548.Dv BPF_T_MICROTIME ,
549.Dv BPF_T_MICROTIME_FAST ,
550.Dv BPF_T_MICROTIME_MONOTONIC ,
551or
552.Dv BPF_T_MICROTIME_MONOTONIC_FAST
553to get time stamps in 64-bit
554.Vt struct timeval
555format.
556Set to
557.Dv BPF_T_NANOTIME ,
558.Dv BPF_T_NANOTIME_FAST ,
559.Dv BPF_T_NANOTIME_MONOTONIC ,
560or
561.Dv BPF_T_NANOTIME_MONOTONIC_FAST
562to get time stamps in 64-bit
563.Vt struct timespec
564format.
565Set to
566.Dv BPF_T_BINTIME ,
567.Dv BPF_T_BINTIME_FAST ,
568.Dv BPF_T_NANOTIME_MONOTONIC ,
569or
570.Dv BPF_T_BINTIME_MONOTONIC_FAST
571to get time stamps in 64-bit
572.Vt struct bintime
573format.
574Set to
575.Dv BPF_T_NONE
576to ignore time stamp.
577All 64-bit time stamp formats are wrapped in
578.Vt struct bpf_ts .
579The
580.Dv BPF_T_MICROTIME_FAST ,
581.Dv BPF_T_NANOTIME_FAST ,
582.Dv BPF_T_BINTIME_FAST ,
583.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
584.Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
585and
586.Dv BPF_T_BINTIME_MONOTONIC_FAST
587are analogs of corresponding formats without _FAST suffix but do not perform
588a full time counter query, so their accuracy is one timer tick.
589The
590.Dv BPF_T_MICROTIME_MONOTONIC ,
591.Dv BPF_T_NANOTIME_MONOTONIC ,
592.Dv BPF_T_BINTIME_MONOTONIC ,
593.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
594.Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
595and
596.Dv BPF_T_BINTIME_MONOTONIC_FAST
597store the time elapsed since kernel boot.
598This setting is initialized to
599.Dv BPF_T_MICROTIME
600by default.
601.It Dv BIOCFEEDBACK
602.Pq Li u_int
603Set packet feedback mode.
604This allows injected packets to be fed back as input to the interface when
605output via the interface is successful.
606When
607.Dv BPF_D_INOUT
608direction is set, injected outgoing packet is not returned by BPF to avoid
609duplication.
610This flag is initialized to zero by default.
611.It Dv BIOCLOCK
612Set the locked flag on the
613.Nm
614descriptor.
615This prevents the execution of
616ioctl commands which could change the underlying operating parameters of
617the device.
618.It Dv BIOCGETBUFMODE
619.It Dv BIOCSETBUFMODE
620.Pq Li u_int
621Get or set the current
622.Nm
623buffering mode; possible values are
624.Dv BPF_BUFMODE_BUFFER ,
625buffered read mode, and
626.Dv BPF_BUFMODE_ZBUF ,
627zero-copy buffer mode.
628.It Dv BIOCSETZBUF
629.Pq Li struct bpf_zbuf
630Set the current zero-copy buffer locations; buffer locations may be
631set only once zero-copy buffer mode has been selected, and prior to attaching
632to an interface.
633Buffers must be of identical size, page-aligned, and an integer multiple of
634pages in size.
635The three fields
636.Vt bz_bufa ,
637.Vt bz_bufb ,
638and
639.Vt bz_buflen
640must be filled out.
641If buffers have already been set for this device, the ioctl will fail.
642.It Dv BIOCGETZMAX
643.Pq Li size_t
644Get the largest individual zero-copy buffer size allowed.
645As two buffers are used in zero-copy buffer mode, the limit (in practice) is
646twice the returned size.
647As zero-copy buffers consume kernel address space, conservative selection of
648buffer size is suggested, especially when there are multiple
649.Nm
650descriptors in use on 32-bit systems.
651.It Dv BIOCROTZBUF
652Force ownership of the next buffer to be assigned to userspace, if any data
653present in the buffer.
654If no data is present, the buffer will remain owned by the kernel.
655This allows consumers of zero-copy buffering to implement timeouts and
656retrieve partially filled buffers.
657In order to handle the case where no data is present in the buffer and
658therefore ownership is not assigned, the user process must check
659.Vt bzh_kernel_gen
660against
661.Vt bzh_user_gen .
662.It Dv BIOCSETVLANPCP
663Set the VLAN PCP bits to the supplied value.
664.El
665.Sh STANDARD IOCTLS
666.Nm
667now supports several standard
668.Xr ioctl 2 Ns 's
669which allow the user to do async and/or non-blocking I/O to an open
670.I bpf
671file descriptor.
672.Bl -tag -width SIOCGIFADDR
673.It Dv FIONREAD
674.Pq Li int
675Returns the number of bytes that are immediately available for reading.
676.It Dv SIOCGIFADDR
677.Pq Li "struct ifreq"
678Returns the address associated with the interface.
679.It Dv FIONBIO
680.Pq Li int
681Sets or clears non-blocking I/O.
682If arg is non-zero, then doing a
683.Xr read 2
684when no data is available will return -1 and
685.Va errno
686will be set to
687.Er EAGAIN .
688If arg is zero, non-blocking I/O is disabled.
689Note: setting this overrides the timeout set by
690.Dv BIOCSRTIMEOUT .
691.It Dv FIOASYNC
692.Pq Li int
693Enables or disables async I/O.
694When enabled (arg is non-zero), the process or process group specified by
695.Dv FIOSETOWN
696will start receiving
697.Dv SIGIO 's
698when packets arrive.
699Note that you must do an
700.Dv FIOSETOWN
701in order for this to take affect,
702as the system will not default this for you.
703The signal may be changed via
704.Dv BIOCSRSIG .
705.It Dv FIOSETOWN
706.It Dv FIOGETOWN
707.Pq Li int
708Sets or gets the process or process group (if negative) that should
709receive
710.Dv SIGIO
711when packets are available.
712The signal may be changed using
713.Dv BIOCSRSIG
714(see above).
715.El
716.Sh BPF HEADER
717One of the following structures is prepended to each packet returned by
718.Xr read 2
719or via a zero-copy buffer:
720.Bd -literal
721struct bpf_xhdr {
722	struct bpf_ts	bh_tstamp;     /* time stamp */
723	uint32_t	bh_caplen;     /* length of captured portion */
724	uint32_t	bh_datalen;    /* original length of packet */
725	u_short		bh_hdrlen;     /* length of bpf header (this struct
726					  plus alignment padding) */
727};
728
729struct bpf_hdr {
730	struct timeval	bh_tstamp;     /* time stamp */
731	uint32_t	bh_caplen;     /* length of captured portion */
732	uint32_t	bh_datalen;    /* original length of packet */
733	u_short		bh_hdrlen;     /* length of bpf header (this struct
734					  plus alignment padding) */
735};
736.Ed
737.Pp
738The fields, whose values are stored in host order, and are:
739.Pp
740.Bl -tag -compact -width bh_datalen
741.It Li bh_tstamp
742The time at which the packet was processed by the packet filter.
743.It Li bh_caplen
744The length of the captured portion of the packet.
745This is the minimum of
746the truncation amount specified by the filter and the length of the packet.
747.It Li bh_datalen
748The length of the packet off the wire.
749This value is independent of the truncation amount specified by the filter.
750.It Li bh_hdrlen
751The length of the
752.Nm
753header, which may not be equal to
754.\" XXX - not really a function call
755.Fn sizeof "struct bpf_xhdr"
756or
757.Fn sizeof "struct bpf_hdr" .
758.El
759.Pp
760The
761.Li bh_hdrlen
762field exists to account for
763padding between the header and the link level protocol.
764The purpose here is to guarantee proper alignment of the packet
765data structures, which is required on alignment sensitive
766architectures and improves performance on many other architectures.
767The packet filter ensures that the
768.Vt bpf_xhdr ,
769.Vt bpf_hdr
770and the network layer
771header will be word aligned.
772Currently,
773.Vt bpf_hdr
774is used when the time stamp is set to
775.Dv BPF_T_MICROTIME ,
776.Dv BPF_T_MICROTIME_FAST ,
777.Dv BPF_T_MICROTIME_MONOTONIC ,
778.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
779or
780.Dv BPF_T_NONE
781for backward compatibility reasons.
782Otherwise,
783.Vt bpf_xhdr
784is used.
785However,
786.Vt bpf_hdr
787may be deprecated in the near future.
788Suitable precautions
789must be taken when accessing the link layer protocol fields on alignment
790restricted machines.
791(This is not a problem on an Ethernet, since
792the type field is a short falling on an even offset,
793and the addresses are probably accessed in a bytewise fashion).
794.Pp
795Additionally, individual packets are padded so that each starts
796on a word boundary.
797This requires that an application
798has some knowledge of how to get from packet to packet.
799The macro
800.Dv BPF_WORDALIGN
801is defined in
802.In net/bpf.h
803to facilitate
804this process.
805It rounds up its argument to the nearest word aligned value (where a word is
806.Dv BPF_ALIGNMENT
807bytes wide).
808.Pp
809For example, if
810.Sq Li p
811points to the start of a packet, this expression
812will advance it to the next packet:
813.Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen)
814.Pp
815For the alignment mechanisms to work properly, the
816buffer passed to
817.Xr read 2
818must itself be word aligned.
819The
820.Xr malloc 3
821function
822will always return an aligned buffer.
823.Sh FILTER MACHINE
824A filter program is an array of instructions, with all branches forwardly
825directed, terminated by a
826.Em return
827instruction.
828Each instruction performs some action on the pseudo-machine state,
829which consists of an accumulator, index register, scratch memory store,
830and implicit program counter.
831.Pp
832The following structure defines the instruction format:
833.Bd -literal
834struct bpf_insn {
835	u_short     code;
836	u_char      jt;
837	u_char      jf;
838	bpf_u_int32 k;
839};
840.Ed
841.Pp
842The
843.Li k
844field is used in different ways by different instructions,
845and the
846.Li jt
847and
848.Li jf
849fields are used as offsets
850by the branch instructions.
851The opcodes are encoded in a semi-hierarchical fashion.
852There are eight classes of instructions:
853.Dv BPF_LD ,
854.Dv BPF_LDX ,
855.Dv BPF_ST ,
856.Dv BPF_STX ,
857.Dv BPF_ALU ,
858.Dv BPF_JMP ,
859.Dv BPF_RET ,
860and
861.Dv BPF_MISC .
862Various other mode and
863operator bits are or'd into the class to give the actual instructions.
864The classes and modes are defined in
865.In net/bpf.h .
866.Pp
867Below are the semantics for each defined
868.Nm
869instruction.
870We use the convention that A is the accumulator, X is the index register,
871P[] packet data, and M[] scratch memory store.
872P[i:n] gives the data at byte offset
873.Dq i
874in the packet,
875interpreted as a word (n=4),
876unsigned halfword (n=2), or unsigned byte (n=1).
877M[i] gives the i'th word in the scratch memory store, which is only
878addressed in word units.
879The memory store is indexed from 0 to
880.Dv BPF_MEMWORDS
881- 1.
882.Li k ,
883.Li jt ,
884and
885.Li jf
886are the corresponding fields in the
887instruction definition.
888.Dq len
889refers to the length of the packet.
890.Bl -tag -width BPF_STXx
891.It Dv BPF_LD
892These instructions copy a value into the accumulator.
893The type of the source operand is specified by an
894.Dq addressing mode
895and can be a constant
896.Pq Dv BPF_IMM ,
897packet data at a fixed offset
898.Pq Dv BPF_ABS ,
899packet data at a variable offset
900.Pq Dv BPF_IND ,
901the packet length
902.Pq Dv BPF_LEN ,
903or a word in the scratch memory store
904.Pq Dv BPF_MEM .
905For
906.Dv BPF_IND
907and
908.Dv BPF_ABS ,
909the data size must be specified as a word
910.Pq Dv BPF_W ,
911halfword
912.Pq Dv BPF_H ,
913or byte
914.Pq Dv BPF_B .
915The semantics of all the recognized
916.Dv BPF_LD
917instructions follow.
918.Bd -literal
919BPF_LD+BPF_W+BPF_ABS	A <- P[k:4]
920BPF_LD+BPF_H+BPF_ABS	A <- P[k:2]
921BPF_LD+BPF_B+BPF_ABS	A <- P[k:1]
922BPF_LD+BPF_W+BPF_IND	A <- P[X+k:4]
923BPF_LD+BPF_H+BPF_IND	A <- P[X+k:2]
924BPF_LD+BPF_B+BPF_IND	A <- P[X+k:1]
925BPF_LD+BPF_W+BPF_LEN	A <- len
926BPF_LD+BPF_IMM		A <- k
927BPF_LD+BPF_MEM		A <- M[k]
928.Ed
929.It Dv BPF_LDX
930These instructions load a value into the index register.
931Note that
932the addressing modes are more restrictive than those of the accumulator loads,
933but they include
934.Dv BPF_MSH ,
935a hack for efficiently loading the IP header length.
936.Bd -literal
937BPF_LDX+BPF_W+BPF_IMM	X <- k
938BPF_LDX+BPF_W+BPF_MEM	X <- M[k]
939BPF_LDX+BPF_W+BPF_LEN	X <- len
940BPF_LDX+BPF_B+BPF_MSH	X <- 4*(P[k:1]&0xf)
941.Ed
942.It Dv BPF_ST
943This instruction stores the accumulator into the scratch memory.
944We do not need an addressing mode since there is only one possibility
945for the destination.
946.Bd -literal
947BPF_ST			M[k] <- A
948.Ed
949.It Dv BPF_STX
950This instruction stores the index register in the scratch memory store.
951.Bd -literal
952BPF_STX			M[k] <- X
953.Ed
954.It Dv BPF_ALU
955The alu instructions perform operations between the accumulator and
956index register or constant, and store the result back in the accumulator.
957For binary operations, a source mode is required
958.Dv ( BPF_K
959or
960.Dv BPF_X ) .
961.Bd -literal
962BPF_ALU+BPF_ADD+BPF_K	A <- A + k
963BPF_ALU+BPF_SUB+BPF_K	A <- A - k
964BPF_ALU+BPF_MUL+BPF_K	A <- A * k
965BPF_ALU+BPF_DIV+BPF_K	A <- A / k
966BPF_ALU+BPF_MOD+BPF_K	A <- A % k
967BPF_ALU+BPF_AND+BPF_K	A <- A & k
968BPF_ALU+BPF_OR+BPF_K	A <- A | k
969BPF_ALU+BPF_XOR+BPF_K	A <- A ^ k
970BPF_ALU+BPF_LSH+BPF_K	A <- A << k
971BPF_ALU+BPF_RSH+BPF_K	A <- A >> k
972BPF_ALU+BPF_ADD+BPF_X	A <- A + X
973BPF_ALU+BPF_SUB+BPF_X	A <- A - X
974BPF_ALU+BPF_MUL+BPF_X	A <- A * X
975BPF_ALU+BPF_DIV+BPF_X	A <- A / X
976BPF_ALU+BPF_MOD+BPF_X	A <- A % X
977BPF_ALU+BPF_AND+BPF_X	A <- A & X
978BPF_ALU+BPF_OR+BPF_X	A <- A | X
979BPF_ALU+BPF_XOR+BPF_X	A <- A ^ X
980BPF_ALU+BPF_LSH+BPF_X	A <- A << X
981BPF_ALU+BPF_RSH+BPF_X	A <- A >> X
982BPF_ALU+BPF_NEG		A <- -A
983.Ed
984.It Dv BPF_JMP
985The jump instructions alter flow of control.
986Conditional jumps
987compare the accumulator against a constant
988.Pq Dv BPF_K
989or the index register
990.Pq Dv BPF_X .
991If the result is true (or non-zero),
992the true branch is taken, otherwise the false branch is taken.
993Jump offsets are encoded in 8 bits so the longest jump is 256 instructions.
994However, the jump always
995.Pq Dv BPF_JA
996opcode uses the 32 bit
997.Li k
998field as the offset, allowing arbitrarily distant destinations.
999All conditionals use unsigned comparison conventions.
1000.Bd -literal
1001BPF_JMP+BPF_JA		pc += k
1002BPF_JMP+BPF_JGT+BPF_K	pc += (A > k) ? jt : jf
1003BPF_JMP+BPF_JGE+BPF_K	pc += (A >= k) ? jt : jf
1004BPF_JMP+BPF_JEQ+BPF_K	pc += (A == k) ? jt : jf
1005BPF_JMP+BPF_JSET+BPF_K	pc += (A & k) ? jt : jf
1006BPF_JMP+BPF_JGT+BPF_X	pc += (A > X) ? jt : jf
1007BPF_JMP+BPF_JGE+BPF_X	pc += (A >= X) ? jt : jf
1008BPF_JMP+BPF_JEQ+BPF_X	pc += (A == X) ? jt : jf
1009BPF_JMP+BPF_JSET+BPF_X	pc += (A & X) ? jt : jf
1010.Ed
1011.It Dv BPF_RET
1012The return instructions terminate the filter program and specify the amount
1013of packet to accept (i.e., they return the truncation amount).
1014A return value of zero indicates that the packet should be ignored.
1015The return value is either a constant
1016.Pq Dv BPF_K
1017or the accumulator
1018.Pq Dv BPF_A .
1019.Bd -literal
1020BPF_RET+BPF_A		accept A bytes
1021BPF_RET+BPF_K		accept k bytes
1022.Ed
1023.It Dv BPF_MISC
1024The miscellaneous category was created for anything that does not
1025fit into the above classes, and for any new instructions that might need to
1026be added.
1027Currently, these are the register transfer instructions
1028that copy the index register to the accumulator or vice versa.
1029.Bd -literal
1030BPF_MISC+BPF_TAX	X <- A
1031BPF_MISC+BPF_TXA	A <- X
1032.Ed
1033.El
1034.Pp
1035The
1036.Nm
1037interface provides the following macros to facilitate
1038array initializers:
1039.Fn BPF_STMT opcode operand
1040and
1041.Fn BPF_JUMP opcode operand true_offset false_offset .
1042.Sh SYSCTL VARIABLES
1043A set of
1044.Xr sysctl 8
1045variables controls the behaviour of the
1046.Nm
1047subsystem
1048.Bl -tag -width indent
1049.It Va net.bpf.optimize_writers : No 0
1050Various programs use BPF to send (but not receive) raw packets
1051(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs).
1052They do not need incoming packets to be send to them.
1053Turning this option on
1054makes new BPF users to be attached to write-only interface list until program
1055explicitly specifies read filter via
1056.Fn pcap_set_filter .
1057This removes any performance degradation for high-speed interfaces.
1058.It Va net.bpf.stats :
1059Binary interface for retrieving general statistics.
1060.It Va net.bpf.zerocopy_enable : No 0
1061Permits zero-copy to be used with net BPF readers.
1062Use with caution.
1063.It Va net.bpf.maxinsns : No 512
1064Maximum number of instructions that BPF program can contain.
1065Use
1066.Xr tcpdump 1
1067.Fl d
1068option to determine approximate number of instruction for any filter.
1069.It Va net.bpf.maxbufsize : No 524288
1070Maximum buffer size to allocate for packets buffer.
1071.It Va net.bpf.bufsize : No 4096
1072Default buffer size to allocate for packets buffer.
1073.El
1074.Sh EXAMPLES
1075The following filter is taken from the Reverse ARP Daemon.
1076It accepts only Reverse ARP requests.
1077.Bd -literal
1078struct bpf_insn insns[] = {
1079	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1080	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3),
1081	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
1082	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1),
1083	BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
1084		 sizeof(struct ether_header)),
1085	BPF_STMT(BPF_RET+BPF_K, 0),
1086};
1087.Ed
1088.Pp
1089This filter accepts only IP packets between host 128.3.112.15 and
1090128.3.112.35.
1091.Bd -literal
1092struct bpf_insn insns[] = {
1093	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1094	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8),
1095	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26),
1096	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2),
1097	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
1098	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4),
1099	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3),
1100	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
1101	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1),
1102	BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
1103	BPF_STMT(BPF_RET+BPF_K, 0),
1104};
1105.Ed
1106.Pp
1107Finally, this filter returns only TCP finger packets.
1108We must parse the IP header to reach the TCP header.
1109The
1110.Dv BPF_JSET
1111instruction
1112checks that the IP fragment offset is 0 so we are sure
1113that we have a TCP header.
1114.Bd -literal
1115struct bpf_insn insns[] = {
1116	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1117	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10),
1118	BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23),
1119	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8),
1120	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
1121	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0),
1122	BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14),
1123	BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14),
1124	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0),
1125	BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16),
1126	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1),
1127	BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
1128	BPF_STMT(BPF_RET+BPF_K, 0),
1129};
1130.Ed
1131.Sh SEE ALSO
1132.Xr tcpdump 1 ,
1133.Xr ioctl 2 ,
1134.Xr kqueue 2 ,
1135.Xr poll 2 ,
1136.Xr select 2 ,
1137.Xr ng_bpf 4 ,
1138.Xr bpf 9
1139.Rs
1140.%A McCanne, S.
1141.%A Jacobson V.
1142.%T "An efficient, extensible, and portable network monitor"
1143.Re
1144.Sh HISTORY
1145The Enet packet filter was created in 1980 by Mike Accetta and
1146Rick Rashid at Carnegie-Mellon University.
1147Jeffrey Mogul, at
1148Stanford, ported the code to
1149.Bx
1150and continued its development from
11511983 on.
1152Since then, it has evolved into the Ultrix Packet Filter at
1153.Tn DEC ,
1154a
1155.Tn STREAMS
1156.Tn NIT
1157module under
1158.Tn SunOS 4.1 ,
1159and
1160.Tn BPF .
1161.Sh AUTHORS
1162.An -nosplit
1163.An Steven McCanne ,
1164of Lawrence Berkeley Laboratory, implemented BPF in
1165Summer 1990.
1166Much of the design is due to
1167.An Van Jacobson .
1168.Pp
1169Support for zero-copy buffers was added by
1170.An Robert N. M. Watson
1171under contract to Seccuris Inc.
1172.Sh BUGS
1173The read buffer must be of a fixed size (returned by the
1174.Dv BIOCGBLEN
1175ioctl).
1176.Pp
1177A file that does not request promiscuous mode may receive promiscuously
1178received packets as a side effect of another file requesting this
1179mode on the same hardware interface.
1180This could be fixed in the kernel with additional processing overhead.
1181However, we favor the model where
1182all files must assume that the interface is promiscuous, and if
1183so desired, must utilize a filter to reject foreign packets.
1184.Pp
1185The
1186.Dv SEESENT ,
1187.Dv DIRECTION ,
1188and
1189.Dv FEEDBACK
1190settings have been observed to work incorrectly on some interface
1191types, including those with hardware loopback rather than software loopback,
1192and point-to-point interfaces.
1193They appear to function correctly on a
1194broad range of Ethernet-style interfaces.
1195