1.\" Copyright (c) 2007 Seccuris Inc. 2.\" All rights reserved. 3.\" 4.\" This software was developed by Robert N. M. Watson under contract to 5.\" Seccuris Inc. 6.\" 7.\" Redistribution and use in source and binary forms, with or without 8.\" modification, are permitted provided that the following conditions 9.\" are met: 10.\" 1. Redistributions of source code must retain the above copyright 11.\" notice, this list of conditions and the following disclaimer. 12.\" 2. Redistributions in binary form must reproduce the above copyright 13.\" notice, this list of conditions and the following disclaimer in the 14.\" documentation and/or other materials provided with the distribution. 15.\" 16.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 19.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 26.\" SUCH DAMAGE. 27.\" 28.\" Copyright (c) 1990 The Regents of the University of California. 29.\" All rights reserved. 30.\" 31.\" Redistribution and use in source and binary forms, with or without 32.\" modification, are permitted provided that: (1) source code distributions 33.\" retain the above copyright notice and this paragraph in its entirety, (2) 34.\" distributions including binary code include the above copyright notice and 35.\" this paragraph in its entirety in the documentation or other materials 36.\" provided with the distribution, and (3) all advertising materials mentioning 37.\" features or use of this software display the following acknowledgement: 38.\" ``This product includes software developed by the University of California, 39.\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of 40.\" the University nor the names of its contributors may be used to endorse 41.\" or promote products derived from this software without specific prior 42.\" written permission. 43.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED 44.\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF 45.\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. 46.\" 47.\" This document is derived in part from the enet man page (enet.4) 48.\" distributed with 4.3BSD Unix. 49.\" 50.\" $FreeBSD$ 51.\" 52.Dd July 22, 2021 53.Dt BPF 4 54.Os 55.Sh NAME 56.Nm bpf 57.Nd Berkeley Packet Filter 58.Sh SYNOPSIS 59.Cd device bpf 60.Sh DESCRIPTION 61The Berkeley Packet Filter 62provides a raw interface to data link layers in a protocol 63independent fashion. 64All packets on the network, even those destined for other hosts, 65are accessible through this mechanism. 66.Pp 67The packet filter appears as a character special device, 68.Pa /dev/bpf . 69After opening the device, the file descriptor must be bound to a 70specific network interface with the 71.Dv BIOCSETIF 72ioctl. 73A given interface can be shared by multiple listeners, and the filter 74underlying each descriptor will see an identical packet stream. 75.Pp 76Associated with each open instance of a 77.Nm 78file is a user-settable packet filter. 79Whenever a packet is received by an interface, 80all file descriptors listening on that interface apply their filter. 81Each descriptor that accepts the packet receives its own copy. 82.Pp 83A packet can be sent out on the network by writing to a 84.Nm 85file descriptor. 86The writes are unbuffered, meaning only one packet can be processed per write. 87Currently, only writes to Ethernets and 88.Tn SLIP 89links are supported. 90.Sh BUFFER MODES 91.Nm 92devices deliver packet data to the application via memory buffers provided by 93the application. 94The buffer mode is set using the 95.Dv BIOCSETBUFMODE 96ioctl, and read using the 97.Dv BIOCGETBUFMODE 98ioctl. 99.Ss Buffered read mode 100By default, 101.Nm 102devices operate in the 103.Dv BPF_BUFMODE_BUFFER 104mode, in which packet data is copied explicitly from kernel to user memory 105using the 106.Xr read 2 107system call. 108The user process will declare a fixed buffer size that will be used both for 109sizing internal buffers and for all 110.Xr read 2 111operations on the file. 112This size is queried using the 113.Dv BIOCGBLEN 114ioctl, and is set using the 115.Dv BIOCSBLEN 116ioctl. 117Note that an individual packet larger than the buffer size is necessarily 118truncated. 119.Ss Zero-copy buffer mode 120.Nm 121devices may also operate in the 122.Dv BPF_BUFMODE_ZEROCOPY 123mode, in which packet data is written directly into two user memory buffers 124by the kernel, avoiding both system call and copying overhead. 125Buffers are of fixed (and equal) size, page-aligned, and an even multiple of 126the page size. 127The maximum zero-copy buffer size is returned by the 128.Dv BIOCGETZMAX 129ioctl. 130Note that an individual packet larger than the buffer size is necessarily 131truncated. 132.Pp 133The user process registers two memory buffers using the 134.Dv BIOCSETZBUF 135ioctl, which accepts a 136.Vt struct bpf_zbuf 137pointer as an argument: 138.Bd -literal 139struct bpf_zbuf { 140 void *bz_bufa; 141 void *bz_bufb; 142 size_t bz_buflen; 143}; 144.Ed 145.Pp 146.Vt bz_bufa 147is a pointer to the userspace address of the first buffer that will be 148filled, and 149.Vt bz_bufb 150is a pointer to the second buffer. 151.Nm 152will then cycle between the two buffers as they fill and are acknowledged. 153.Pp 154Each buffer begins with a fixed-length header to hold synchronization and 155data length information for the buffer: 156.Bd -literal 157struct bpf_zbuf_header { 158 volatile u_int bzh_kernel_gen; /* Kernel generation number. */ 159 volatile u_int bzh_kernel_len; /* Length of data in the buffer. */ 160 volatile u_int bzh_user_gen; /* User generation number. */ 161 /* ...padding for future use... */ 162}; 163.Ed 164.Pp 165The header structure of each buffer, including all padding, should be zeroed 166before it is configured using 167.Dv BIOCSETZBUF . 168Remaining space in the buffer will be used by the kernel to store packet 169data, laid out in the same format as with buffered read mode. 170.Pp 171The kernel and the user process follow a simple acknowledgement protocol via 172the buffer header to synchronize access to the buffer: when the header 173generation numbers, 174.Vt bzh_kernel_gen 175and 176.Vt bzh_user_gen , 177hold the same value, the kernel owns the buffer, and when they differ, 178userspace owns the buffer. 179.Pp 180While the kernel owns the buffer, the contents are unstable and may change 181asynchronously; while the user process owns the buffer, its contents are 182stable and will not be changed until the buffer has been acknowledged. 183.Pp 184Initializing the buffer headers to all 0's before registering the buffer has 185the effect of assigning initial ownership of both buffers to the kernel. 186The kernel signals that a buffer has been assigned to userspace by modifying 187.Vt bzh_kernel_gen , 188and userspace acknowledges the buffer and returns it to the kernel by setting 189the value of 190.Vt bzh_user_gen 191to the value of 192.Vt bzh_kernel_gen . 193.Pp 194In order to avoid caching and memory re-ordering effects, the user process 195must use atomic operations and memory barriers when checking for and 196acknowledging buffers: 197.Bd -literal 198#include <machine/atomic.h> 199 200/* 201 * Return ownership of a buffer to the kernel for reuse. 202 */ 203static void 204buffer_acknowledge(struct bpf_zbuf_header *bzh) 205{ 206 207 atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen); 208} 209 210/* 211 * Check whether a buffer has been assigned to userspace by the kernel. 212 * Return true if userspace owns the buffer, and false otherwise. 213 */ 214static int 215buffer_check(struct bpf_zbuf_header *bzh) 216{ 217 218 return (bzh->bzh_user_gen != 219 atomic_load_acq_int(&bzh->bzh_kernel_gen)); 220} 221.Ed 222.Pp 223The user process may force the assignment of the next buffer, if any data 224is pending, to userspace using the 225.Dv BIOCROTZBUF 226ioctl. 227This allows the user process to retrieve data in a partially filled buffer 228before the buffer is full, such as following a timeout; the process must 229recheck for buffer ownership using the header generation numbers, as the 230buffer will not be assigned to userspace if no data was present. 231.Pp 232As in the buffered read mode, 233.Xr kqueue 2 , 234.Xr poll 2 , 235and 236.Xr select 2 237may be used to sleep awaiting the availability of a completed buffer. 238They will return a readable file descriptor when ownership of the next buffer 239is assigned to user space. 240.Pp 241In the current implementation, the kernel may assign zero, one, or both 242buffers to the user process; however, an earlier implementation maintained 243the invariant that at most one buffer could be assigned to the user process 244at a time. 245In order to both ensure progress and high performance, user processes should 246acknowledge a completely processed buffer as quickly as possible, returning 247it for reuse, and not block waiting on a second buffer while holding another 248buffer. 249.Sh IOCTLS 250The 251.Xr ioctl 2 252command codes below are defined in 253.In net/bpf.h . 254All commands require 255these includes: 256.Bd -literal 257 #include <sys/types.h> 258 #include <sys/time.h> 259 #include <sys/ioctl.h> 260 #include <net/bpf.h> 261.Ed 262.Pp 263Additionally, 264.Dv BIOCGETIF 265and 266.Dv BIOCSETIF 267require 268.In sys/socket.h 269and 270.In net/if.h . 271.Pp 272In addition to 273.Dv FIONREAD 274the following commands may be applied to any open 275.Nm 276file. 277The (third) argument to 278.Xr ioctl 2 279should be a pointer to the type indicated. 280.Bl -tag -width BIOCGETBUFMODE 281.It Dv BIOCGBLEN 282.Pq Li u_int 283Returns the required buffer length for reads on 284.Nm 285files. 286.It Dv BIOCSBLEN 287.Pq Li u_int 288Sets the buffer length for reads on 289.Nm 290files. 291The buffer must be set before the file is attached to an interface 292with 293.Dv BIOCSETIF . 294If the requested buffer size cannot be accommodated, the closest 295allowable size will be set and returned in the argument. 296A read call will result in 297.Er EINVAL 298if it is passed a buffer that is not this size. 299.It Dv BIOCGDLT 300.Pq Li u_int 301Returns the type of the data link layer underlying the attached interface. 302.Er EINVAL 303is returned if no interface has been specified. 304The device types, prefixed with 305.Dq Li DLT_ , 306are defined in 307.In net/bpf.h . 308.It Dv BIOCGDLTLIST 309.Pq Li "struct bpf_dltlist" 310Returns an array of the available types of the data link layer 311underlying the attached interface: 312.Bd -literal -offset indent 313struct bpf_dltlist { 314 u_int bfl_len; 315 u_int *bfl_list; 316}; 317.Ed 318.Pp 319The available types are returned in the array pointed to by the 320.Va bfl_list 321field while their length in u_int is supplied to the 322.Va bfl_len 323field. 324.Er ENOMEM 325is returned if there is not enough buffer space and 326.Er EFAULT 327is returned if a bad address is encountered. 328The 329.Va bfl_len 330field is modified on return to indicate the actual length in u_int 331of the array returned. 332If 333.Va bfl_list 334is 335.Dv NULL , 336the 337.Va bfl_len 338field is set to indicate the required length of an array in u_int. 339.It Dv BIOCSDLT 340.Pq Li u_int 341Changes the type of the data link layer underlying the attached interface. 342.Er EINVAL 343is returned if no interface has been specified or the specified 344type is not available for the interface. 345.It Dv BIOCPROMISC 346Forces the interface into promiscuous mode. 347All packets, not just those destined for the local host, are processed. 348Since more than one file can be listening on a given interface, 349a listener that opened its interface non-promiscuously may receive 350packets promiscuously. 351This problem can be remedied with an appropriate filter. 352.Pp 353The interface remains in promiscuous mode until all files listening 354promiscuously are closed. 355.It Dv BIOCFLUSH 356Flushes the buffer of incoming packets, 357and resets the statistics that are returned by BIOCGSTATS. 358.It Dv BIOCGETIF 359.Pq Li "struct ifreq" 360Returns the name of the hardware interface that the file is listening on. 361The name is returned in the ifr_name field of 362the 363.Li ifreq 364structure. 365All other fields are undefined. 366.It Dv BIOCSETIF 367.Pq Li "struct ifreq" 368Sets the hardware interface associated with the file. 369This 370command must be performed before any packets can be read. 371The device is indicated by name using the 372.Li ifr_name 373field of the 374.Li ifreq 375structure. 376Additionally, performs the actions of 377.Dv BIOCFLUSH . 378.It Dv BIOCSRTIMEOUT 379.It Dv BIOCGRTIMEOUT 380.Pq Li "struct timeval" 381Sets or gets the read timeout parameter. 382The argument 383specifies the length of time to wait before timing 384out on a read request. 385This parameter is initialized to zero by 386.Xr open 2 , 387indicating no timeout. 388.It Dv BIOCGSTATS 389.Pq Li "struct bpf_stat" 390Returns the following structure of packet statistics: 391.Bd -literal 392struct bpf_stat { 393 u_int bs_recv; /* number of packets received */ 394 u_int bs_drop; /* number of packets dropped */ 395}; 396.Ed 397.Pp 398The fields are: 399.Bl -hang -offset indent 400.It Li bs_recv 401the number of packets received by the descriptor since opened or reset 402(including any buffered since the last read call); 403and 404.It Li bs_drop 405the number of packets which were accepted by the filter but dropped by the 406kernel because of buffer overflows 407(i.e., the application's reads are not keeping up with the packet traffic). 408.El 409.It Dv BIOCIMMEDIATE 410.Pq Li u_int 411Enables or disables 412.Dq immediate mode , 413based on the truth value of the argument. 414When immediate mode is enabled, reads return immediately upon packet 415reception. 416Otherwise, a read will block until either the kernel buffer 417becomes full or a timeout occurs. 418This is useful for programs like 419.Xr rarpd 8 420which must respond to messages in real time. 421The default for a new file is off. 422.It Dv BIOCSETF 423.It Dv BIOCSETFNR 424.Pq Li "struct bpf_program" 425Sets the read filter program used by the kernel to discard uninteresting 426packets. 427An array of instructions and its length is passed in using 428the following structure: 429.Bd -literal 430struct bpf_program { 431 u_int bf_len; 432 struct bpf_insn *bf_insns; 433}; 434.Ed 435.Pp 436The filter program is pointed to by the 437.Li bf_insns 438field while its length in units of 439.Sq Li struct bpf_insn 440is given by the 441.Li bf_len 442field. 443See section 444.Sx "FILTER MACHINE" 445for an explanation of the filter language. 446The only difference between 447.Dv BIOCSETF 448and 449.Dv BIOCSETFNR 450is 451.Dv BIOCSETF 452performs the actions of 453.Dv BIOCFLUSH 454while 455.Dv BIOCSETFNR 456does not. 457.It Dv BIOCSETWF 458.Pq Li "struct bpf_program" 459Sets the write filter program used by the kernel to control what type of 460packets can be written to the interface. 461See the 462.Dv BIOCSETF 463command for more 464information on the 465.Nm 466filter program. 467.It Dv BIOCVERSION 468.Pq Li "struct bpf_version" 469Returns the major and minor version numbers of the filter language currently 470recognized by the kernel. 471Before installing a filter, applications must check 472that the current version is compatible with the running kernel. 473Version numbers are compatible if the major numbers match and the application minor 474is less than or equal to the kernel minor. 475The kernel version number is returned in the following structure: 476.Bd -literal 477struct bpf_version { 478 u_short bv_major; 479 u_short bv_minor; 480}; 481.Ed 482.Pp 483The current version numbers are given by 484.Dv BPF_MAJOR_VERSION 485and 486.Dv BPF_MINOR_VERSION 487from 488.In net/bpf.h . 489An incompatible filter 490may result in undefined behavior (most likely, an error returned by 491.Fn ioctl 492or haphazard packet matching). 493.It Dv BIOCGRSIG 494.It Dv BIOCSRSIG 495.Pq Li u_int 496Sets or gets the receive signal. 497This signal will be sent to the process or process group specified by 498.Dv FIOSETOWN . 499It defaults to 500.Dv SIGIO . 501.It Dv BIOCSHDRCMPLT 502.It Dv BIOCGHDRCMPLT 503.Pq Li u_int 504Sets or gets the status of the 505.Dq header complete 506flag. 507Set to zero if the link level source address should be filled in automatically 508by the interface output routine. 509Set to one if the link level source 510address will be written, as provided, to the wire. 511This flag is initialized to zero by default. 512.It Dv BIOCSSEESENT 513.It Dv BIOCGSEESENT 514.Pq Li u_int 515These commands are obsolete but left for compatibility. 516Use 517.Dv BIOCSDIRECTION 518and 519.Dv BIOCGDIRECTION 520instead. 521Sets or gets the flag determining whether locally generated packets on the 522interface should be returned by BPF. 523Set to zero to see only incoming packets on the interface. 524Set to one to see packets originating locally and remotely on the interface. 525This flag is initialized to one by default. 526.It Dv BIOCSDIRECTION 527.It Dv BIOCGDIRECTION 528.Pq Li u_int 529Sets or gets the setting determining whether incoming, outgoing, or all packets 530on the interface should be returned by BPF. 531Set to 532.Dv BPF_D_IN 533to see only incoming packets on the interface. 534Set to 535.Dv BPF_D_INOUT 536to see packets originating locally and remotely on the interface. 537Set to 538.Dv BPF_D_OUT 539to see only outgoing packets on the interface. 540This setting is initialized to 541.Dv BPF_D_INOUT 542by default. 543.It Dv BIOCSTSTAMP 544.It Dv BIOCGTSTAMP 545.Pq Li u_int 546Set or get format and resolution of the time stamps returned by BPF. 547Set to 548.Dv BPF_T_MICROTIME , 549.Dv BPF_T_MICROTIME_FAST , 550.Dv BPF_T_MICROTIME_MONOTONIC , 551or 552.Dv BPF_T_MICROTIME_MONOTONIC_FAST 553to get time stamps in 64-bit 554.Vt struct timeval 555format. 556Set to 557.Dv BPF_T_NANOTIME , 558.Dv BPF_T_NANOTIME_FAST , 559.Dv BPF_T_NANOTIME_MONOTONIC , 560or 561.Dv BPF_T_NANOTIME_MONOTONIC_FAST 562to get time stamps in 64-bit 563.Vt struct timespec 564format. 565Set to 566.Dv BPF_T_BINTIME , 567.Dv BPF_T_BINTIME_FAST , 568.Dv BPF_T_NANOTIME_MONOTONIC , 569or 570.Dv BPF_T_BINTIME_MONOTONIC_FAST 571to get time stamps in 64-bit 572.Vt struct bintime 573format. 574Set to 575.Dv BPF_T_NONE 576to ignore time stamp. 577All 64-bit time stamp formats are wrapped in 578.Vt struct bpf_ts . 579The 580.Dv BPF_T_MICROTIME_FAST , 581.Dv BPF_T_NANOTIME_FAST , 582.Dv BPF_T_BINTIME_FAST , 583.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 584.Dv BPF_T_NANOTIME_MONOTONIC_FAST , 585and 586.Dv BPF_T_BINTIME_MONOTONIC_FAST 587are analogs of corresponding formats without _FAST suffix but do not perform 588a full time counter query, so their accuracy is one timer tick. 589The 590.Dv BPF_T_MICROTIME_MONOTONIC , 591.Dv BPF_T_NANOTIME_MONOTONIC , 592.Dv BPF_T_BINTIME_MONOTONIC , 593.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 594.Dv BPF_T_NANOTIME_MONOTONIC_FAST , 595and 596.Dv BPF_T_BINTIME_MONOTONIC_FAST 597store the time elapsed since kernel boot. 598This setting is initialized to 599.Dv BPF_T_MICROTIME 600by default. 601.It Dv BIOCFEEDBACK 602.Pq Li u_int 603Set packet feedback mode. 604This allows injected packets to be fed back as input to the interface when 605output via the interface is successful. 606When 607.Dv BPF_D_INOUT 608direction is set, injected outgoing packet is not returned by BPF to avoid 609duplication. 610This flag is initialized to zero by default. 611.It Dv BIOCLOCK 612Set the locked flag on the 613.Nm 614descriptor. 615This prevents the execution of 616ioctl commands which could change the underlying operating parameters of 617the device. 618.It Dv BIOCGETBUFMODE 619.It Dv BIOCSETBUFMODE 620.Pq Li u_int 621Get or set the current 622.Nm 623buffering mode; possible values are 624.Dv BPF_BUFMODE_BUFFER , 625buffered read mode, and 626.Dv BPF_BUFMODE_ZBUF , 627zero-copy buffer mode. 628.It Dv BIOCSETZBUF 629.Pq Li struct bpf_zbuf 630Set the current zero-copy buffer locations; buffer locations may be 631set only once zero-copy buffer mode has been selected, and prior to attaching 632to an interface. 633Buffers must be of identical size, page-aligned, and an integer multiple of 634pages in size. 635The three fields 636.Vt bz_bufa , 637.Vt bz_bufb , 638and 639.Vt bz_buflen 640must be filled out. 641If buffers have already been set for this device, the ioctl will fail. 642.It Dv BIOCGETZMAX 643.Pq Li size_t 644Get the largest individual zero-copy buffer size allowed. 645As two buffers are used in zero-copy buffer mode, the limit (in practice) is 646twice the returned size. 647As zero-copy buffers consume kernel address space, conservative selection of 648buffer size is suggested, especially when there are multiple 649.Nm 650descriptors in use on 32-bit systems. 651.It Dv BIOCROTZBUF 652Force ownership of the next buffer to be assigned to userspace, if any data 653present in the buffer. 654If no data is present, the buffer will remain owned by the kernel. 655This allows consumers of zero-copy buffering to implement timeouts and 656retrieve partially filled buffers. 657In order to handle the case where no data is present in the buffer and 658therefore ownership is not assigned, the user process must check 659.Vt bzh_kernel_gen 660against 661.Vt bzh_user_gen . 662.It Dv BIOCSETVLANPCP 663Set the VLAN PCP bits to the supplied value. 664.El 665.Sh STANDARD IOCTLS 666.Nm 667now supports several standard 668.Xr ioctl 2 Ns 's 669which allow the user to do async and/or non-blocking I/O to an open 670.I bpf 671file descriptor. 672.Bl -tag -width SIOCGIFADDR 673.It Dv FIONREAD 674.Pq Li int 675Returns the number of bytes that are immediately available for reading. 676.It Dv SIOCGIFADDR 677.Pq Li "struct ifreq" 678Returns the address associated with the interface. 679.It Dv FIONBIO 680.Pq Li int 681Sets or clears non-blocking I/O. 682If arg is non-zero, then doing a 683.Xr read 2 684when no data is available will return -1 and 685.Va errno 686will be set to 687.Er EAGAIN . 688If arg is zero, non-blocking I/O is disabled. 689Note: setting this overrides the timeout set by 690.Dv BIOCSRTIMEOUT . 691.It Dv FIOASYNC 692.Pq Li int 693Enables or disables async I/O. 694When enabled (arg is non-zero), the process or process group specified by 695.Dv FIOSETOWN 696will start receiving 697.Dv SIGIO 's 698when packets arrive. 699Note that you must do an 700.Dv FIOSETOWN 701in order for this to take affect, 702as the system will not default this for you. 703The signal may be changed via 704.Dv BIOCSRSIG . 705.It Dv FIOSETOWN 706.It Dv FIOGETOWN 707.Pq Li int 708Sets or gets the process or process group (if negative) that should 709receive 710.Dv SIGIO 711when packets are available. 712The signal may be changed using 713.Dv BIOCSRSIG 714(see above). 715.El 716.Sh BPF HEADER 717One of the following structures is prepended to each packet returned by 718.Xr read 2 719or via a zero-copy buffer: 720.Bd -literal 721struct bpf_xhdr { 722 struct bpf_ts bh_tstamp; /* time stamp */ 723 uint32_t bh_caplen; /* length of captured portion */ 724 uint32_t bh_datalen; /* original length of packet */ 725 u_short bh_hdrlen; /* length of bpf header (this struct 726 plus alignment padding) */ 727}; 728 729struct bpf_hdr { 730 struct timeval bh_tstamp; /* time stamp */ 731 uint32_t bh_caplen; /* length of captured portion */ 732 uint32_t bh_datalen; /* original length of packet */ 733 u_short bh_hdrlen; /* length of bpf header (this struct 734 plus alignment padding) */ 735}; 736.Ed 737.Pp 738The fields, whose values are stored in host order, and are: 739.Pp 740.Bl -tag -compact -width bh_datalen 741.It Li bh_tstamp 742The time at which the packet was processed by the packet filter. 743.It Li bh_caplen 744The length of the captured portion of the packet. 745This is the minimum of 746the truncation amount specified by the filter and the length of the packet. 747.It Li bh_datalen 748The length of the packet off the wire. 749This value is independent of the truncation amount specified by the filter. 750.It Li bh_hdrlen 751The length of the 752.Nm 753header, which may not be equal to 754.\" XXX - not really a function call 755.Fn sizeof "struct bpf_xhdr" 756or 757.Fn sizeof "struct bpf_hdr" . 758.El 759.Pp 760The 761.Li bh_hdrlen 762field exists to account for 763padding between the header and the link level protocol. 764The purpose here is to guarantee proper alignment of the packet 765data structures, which is required on alignment sensitive 766architectures and improves performance on many other architectures. 767The packet filter ensures that the 768.Vt bpf_xhdr , 769.Vt bpf_hdr 770and the network layer 771header will be word aligned. 772Currently, 773.Vt bpf_hdr 774is used when the time stamp is set to 775.Dv BPF_T_MICROTIME , 776.Dv BPF_T_MICROTIME_FAST , 777.Dv BPF_T_MICROTIME_MONOTONIC , 778.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 779or 780.Dv BPF_T_NONE 781for backward compatibility reasons. 782Otherwise, 783.Vt bpf_xhdr 784is used. 785However, 786.Vt bpf_hdr 787may be deprecated in the near future. 788Suitable precautions 789must be taken when accessing the link layer protocol fields on alignment 790restricted machines. 791(This is not a problem on an Ethernet, since 792the type field is a short falling on an even offset, 793and the addresses are probably accessed in a bytewise fashion). 794.Pp 795Additionally, individual packets are padded so that each starts 796on a word boundary. 797This requires that an application 798has some knowledge of how to get from packet to packet. 799The macro 800.Dv BPF_WORDALIGN 801is defined in 802.In net/bpf.h 803to facilitate 804this process. 805It rounds up its argument to the nearest word aligned value (where a word is 806.Dv BPF_ALIGNMENT 807bytes wide). 808.Pp 809For example, if 810.Sq Li p 811points to the start of a packet, this expression 812will advance it to the next packet: 813.Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen) 814.Pp 815For the alignment mechanisms to work properly, the 816buffer passed to 817.Xr read 2 818must itself be word aligned. 819The 820.Xr malloc 3 821function 822will always return an aligned buffer. 823.Sh FILTER MACHINE 824A filter program is an array of instructions, with all branches forwardly 825directed, terminated by a 826.Em return 827instruction. 828Each instruction performs some action on the pseudo-machine state, 829which consists of an accumulator, index register, scratch memory store, 830and implicit program counter. 831.Pp 832The following structure defines the instruction format: 833.Bd -literal 834struct bpf_insn { 835 u_short code; 836 u_char jt; 837 u_char jf; 838 bpf_u_int32 k; 839}; 840.Ed 841.Pp 842The 843.Li k 844field is used in different ways by different instructions, 845and the 846.Li jt 847and 848.Li jf 849fields are used as offsets 850by the branch instructions. 851The opcodes are encoded in a semi-hierarchical fashion. 852There are eight classes of instructions: 853.Dv BPF_LD , 854.Dv BPF_LDX , 855.Dv BPF_ST , 856.Dv BPF_STX , 857.Dv BPF_ALU , 858.Dv BPF_JMP , 859.Dv BPF_RET , 860and 861.Dv BPF_MISC . 862Various other mode and 863operator bits are or'd into the class to give the actual instructions. 864The classes and modes are defined in 865.In net/bpf.h . 866.Pp 867Below are the semantics for each defined 868.Nm 869instruction. 870We use the convention that A is the accumulator, X is the index register, 871P[] packet data, and M[] scratch memory store. 872P[i:n] gives the data at byte offset 873.Dq i 874in the packet, 875interpreted as a word (n=4), 876unsigned halfword (n=2), or unsigned byte (n=1). 877M[i] gives the i'th word in the scratch memory store, which is only 878addressed in word units. 879The memory store is indexed from 0 to 880.Dv BPF_MEMWORDS 881- 1. 882.Li k , 883.Li jt , 884and 885.Li jf 886are the corresponding fields in the 887instruction definition. 888.Dq len 889refers to the length of the packet. 890.Bl -tag -width BPF_STXx 891.It Dv BPF_LD 892These instructions copy a value into the accumulator. 893The type of the source operand is specified by an 894.Dq addressing mode 895and can be a constant 896.Pq Dv BPF_IMM , 897packet data at a fixed offset 898.Pq Dv BPF_ABS , 899packet data at a variable offset 900.Pq Dv BPF_IND , 901the packet length 902.Pq Dv BPF_LEN , 903or a word in the scratch memory store 904.Pq Dv BPF_MEM . 905For 906.Dv BPF_IND 907and 908.Dv BPF_ABS , 909the data size must be specified as a word 910.Pq Dv BPF_W , 911halfword 912.Pq Dv BPF_H , 913or byte 914.Pq Dv BPF_B . 915The semantics of all the recognized 916.Dv BPF_LD 917instructions follow. 918.Bd -literal 919BPF_LD+BPF_W+BPF_ABS A <- P[k:4] 920BPF_LD+BPF_H+BPF_ABS A <- P[k:2] 921BPF_LD+BPF_B+BPF_ABS A <- P[k:1] 922BPF_LD+BPF_W+BPF_IND A <- P[X+k:4] 923BPF_LD+BPF_H+BPF_IND A <- P[X+k:2] 924BPF_LD+BPF_B+BPF_IND A <- P[X+k:1] 925BPF_LD+BPF_W+BPF_LEN A <- len 926BPF_LD+BPF_IMM A <- k 927BPF_LD+BPF_MEM A <- M[k] 928.Ed 929.It Dv BPF_LDX 930These instructions load a value into the index register. 931Note that 932the addressing modes are more restrictive than those of the accumulator loads, 933but they include 934.Dv BPF_MSH , 935a hack for efficiently loading the IP header length. 936.Bd -literal 937BPF_LDX+BPF_W+BPF_IMM X <- k 938BPF_LDX+BPF_W+BPF_MEM X <- M[k] 939BPF_LDX+BPF_W+BPF_LEN X <- len 940BPF_LDX+BPF_B+BPF_MSH X <- 4*(P[k:1]&0xf) 941.Ed 942.It Dv BPF_ST 943This instruction stores the accumulator into the scratch memory. 944We do not need an addressing mode since there is only one possibility 945for the destination. 946.Bd -literal 947BPF_ST M[k] <- A 948.Ed 949.It Dv BPF_STX 950This instruction stores the index register in the scratch memory store. 951.Bd -literal 952BPF_STX M[k] <- X 953.Ed 954.It Dv BPF_ALU 955The alu instructions perform operations between the accumulator and 956index register or constant, and store the result back in the accumulator. 957For binary operations, a source mode is required 958.Dv ( BPF_K 959or 960.Dv BPF_X ) . 961.Bd -literal 962BPF_ALU+BPF_ADD+BPF_K A <- A + k 963BPF_ALU+BPF_SUB+BPF_K A <- A - k 964BPF_ALU+BPF_MUL+BPF_K A <- A * k 965BPF_ALU+BPF_DIV+BPF_K A <- A / k 966BPF_ALU+BPF_MOD+BPF_K A <- A % k 967BPF_ALU+BPF_AND+BPF_K A <- A & k 968BPF_ALU+BPF_OR+BPF_K A <- A | k 969BPF_ALU+BPF_XOR+BPF_K A <- A ^ k 970BPF_ALU+BPF_LSH+BPF_K A <- A << k 971BPF_ALU+BPF_RSH+BPF_K A <- A >> k 972BPF_ALU+BPF_ADD+BPF_X A <- A + X 973BPF_ALU+BPF_SUB+BPF_X A <- A - X 974BPF_ALU+BPF_MUL+BPF_X A <- A * X 975BPF_ALU+BPF_DIV+BPF_X A <- A / X 976BPF_ALU+BPF_MOD+BPF_X A <- A % X 977BPF_ALU+BPF_AND+BPF_X A <- A & X 978BPF_ALU+BPF_OR+BPF_X A <- A | X 979BPF_ALU+BPF_XOR+BPF_X A <- A ^ X 980BPF_ALU+BPF_LSH+BPF_X A <- A << X 981BPF_ALU+BPF_RSH+BPF_X A <- A >> X 982BPF_ALU+BPF_NEG A <- -A 983.Ed 984.It Dv BPF_JMP 985The jump instructions alter flow of control. 986Conditional jumps 987compare the accumulator against a constant 988.Pq Dv BPF_K 989or the index register 990.Pq Dv BPF_X . 991If the result is true (or non-zero), 992the true branch is taken, otherwise the false branch is taken. 993Jump offsets are encoded in 8 bits so the longest jump is 256 instructions. 994However, the jump always 995.Pq Dv BPF_JA 996opcode uses the 32 bit 997.Li k 998field as the offset, allowing arbitrarily distant destinations. 999All conditionals use unsigned comparison conventions. 1000.Bd -literal 1001BPF_JMP+BPF_JA pc += k 1002BPF_JMP+BPF_JGT+BPF_K pc += (A > k) ? jt : jf 1003BPF_JMP+BPF_JGE+BPF_K pc += (A >= k) ? jt : jf 1004BPF_JMP+BPF_JEQ+BPF_K pc += (A == k) ? jt : jf 1005BPF_JMP+BPF_JSET+BPF_K pc += (A & k) ? jt : jf 1006BPF_JMP+BPF_JGT+BPF_X pc += (A > X) ? jt : jf 1007BPF_JMP+BPF_JGE+BPF_X pc += (A >= X) ? jt : jf 1008BPF_JMP+BPF_JEQ+BPF_X pc += (A == X) ? jt : jf 1009BPF_JMP+BPF_JSET+BPF_X pc += (A & X) ? jt : jf 1010.Ed 1011.It Dv BPF_RET 1012The return instructions terminate the filter program and specify the amount 1013of packet to accept (i.e., they return the truncation amount). 1014A return value of zero indicates that the packet should be ignored. 1015The return value is either a constant 1016.Pq Dv BPF_K 1017or the accumulator 1018.Pq Dv BPF_A . 1019.Bd -literal 1020BPF_RET+BPF_A accept A bytes 1021BPF_RET+BPF_K accept k bytes 1022.Ed 1023.It Dv BPF_MISC 1024The miscellaneous category was created for anything that does not 1025fit into the above classes, and for any new instructions that might need to 1026be added. 1027Currently, these are the register transfer instructions 1028that copy the index register to the accumulator or vice versa. 1029.Bd -literal 1030BPF_MISC+BPF_TAX X <- A 1031BPF_MISC+BPF_TXA A <- X 1032.Ed 1033.El 1034.Pp 1035The 1036.Nm 1037interface provides the following macros to facilitate 1038array initializers: 1039.Fn BPF_STMT opcode operand 1040and 1041.Fn BPF_JUMP opcode operand true_offset false_offset . 1042.Sh SYSCTL VARIABLES 1043A set of 1044.Xr sysctl 8 1045variables controls the behaviour of the 1046.Nm 1047subsystem 1048.Bl -tag -width indent 1049.It Va net.bpf.optimize_writers : No 0 1050Various programs use BPF to send (but not receive) raw packets 1051(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs). 1052They do not need incoming packets to be send to them. 1053Turning this option on 1054makes new BPF users to be attached to write-only interface list until program 1055explicitly specifies read filter via 1056.Fn pcap_set_filter . 1057This removes any performance degradation for high-speed interfaces. 1058.It Va net.bpf.stats : 1059Binary interface for retrieving general statistics. 1060.It Va net.bpf.zerocopy_enable : No 0 1061Permits zero-copy to be used with net BPF readers. 1062Use with caution. 1063.It Va net.bpf.maxinsns : No 512 1064Maximum number of instructions that BPF program can contain. 1065Use 1066.Xr tcpdump 1 1067.Fl d 1068option to determine approximate number of instruction for any filter. 1069.It Va net.bpf.maxbufsize : No 524288 1070Maximum buffer size to allocate for packets buffer. 1071.It Va net.bpf.bufsize : No 4096 1072Default buffer size to allocate for packets buffer. 1073.El 1074.Sh EXAMPLES 1075The following filter is taken from the Reverse ARP Daemon. 1076It accepts only Reverse ARP requests. 1077.Bd -literal 1078struct bpf_insn insns[] = { 1079 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1080 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3), 1081 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20), 1082 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1), 1083 BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) + 1084 sizeof(struct ether_header)), 1085 BPF_STMT(BPF_RET+BPF_K, 0), 1086}; 1087.Ed 1088.Pp 1089This filter accepts only IP packets between host 128.3.112.15 and 1090128.3.112.35. 1091.Bd -literal 1092struct bpf_insn insns[] = { 1093 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1094 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8), 1095 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26), 1096 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2), 1097 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30), 1098 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4), 1099 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3), 1100 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30), 1101 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1), 1102 BPF_STMT(BPF_RET+BPF_K, (u_int)-1), 1103 BPF_STMT(BPF_RET+BPF_K, 0), 1104}; 1105.Ed 1106.Pp 1107Finally, this filter returns only TCP finger packets. 1108We must parse the IP header to reach the TCP header. 1109The 1110.Dv BPF_JSET 1111instruction 1112checks that the IP fragment offset is 0 so we are sure 1113that we have a TCP header. 1114.Bd -literal 1115struct bpf_insn insns[] = { 1116 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1117 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10), 1118 BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23), 1119 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8), 1120 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20), 1121 BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0), 1122 BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14), 1123 BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14), 1124 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0), 1125 BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16), 1126 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1), 1127 BPF_STMT(BPF_RET+BPF_K, (u_int)-1), 1128 BPF_STMT(BPF_RET+BPF_K, 0), 1129}; 1130.Ed 1131.Sh SEE ALSO 1132.Xr tcpdump 1 , 1133.Xr ioctl 2 , 1134.Xr kqueue 2 , 1135.Xr poll 2 , 1136.Xr select 2 , 1137.Xr ng_bpf 4 , 1138.Xr bpf 9 1139.Rs 1140.%A McCanne, S. 1141.%A Jacobson V. 1142.%T "An efficient, extensible, and portable network monitor" 1143.Re 1144.Sh HISTORY 1145The Enet packet filter was created in 1980 by Mike Accetta and 1146Rick Rashid at Carnegie-Mellon University. 1147Jeffrey Mogul, at 1148Stanford, ported the code to 1149.Bx 1150and continued its development from 11511983 on. 1152Since then, it has evolved into the Ultrix Packet Filter at 1153.Tn DEC , 1154a 1155.Tn STREAMS 1156.Tn NIT 1157module under 1158.Tn SunOS 4.1 , 1159and 1160.Tn BPF . 1161.Sh AUTHORS 1162.An -nosplit 1163.An Steven McCanne , 1164of Lawrence Berkeley Laboratory, implemented BPF in 1165Summer 1990. 1166Much of the design is due to 1167.An Van Jacobson . 1168.Pp 1169Support for zero-copy buffers was added by 1170.An Robert N. M. Watson 1171under contract to Seccuris Inc. 1172.Sh BUGS 1173The read buffer must be of a fixed size (returned by the 1174.Dv BIOCGBLEN 1175ioctl). 1176.Pp 1177A file that does not request promiscuous mode may receive promiscuously 1178received packets as a side effect of another file requesting this 1179mode on the same hardware interface. 1180This could be fixed in the kernel with additional processing overhead. 1181However, we favor the model where 1182all files must assume that the interface is promiscuous, and if 1183so desired, must utilize a filter to reject foreign packets. 1184.Pp 1185The 1186.Dv SEESENT , 1187.Dv DIRECTION , 1188and 1189.Dv FEEDBACK 1190settings have been observed to work incorrectly on some interface 1191types, including those with hardware loopback rather than software loopback, 1192and point-to-point interfaces. 1193They appear to function correctly on a 1194broad range of Ethernet-style interfaces. 1195