1.\" Copyright (c) 2007 Seccuris Inc. 2.\" All rights reserved. 3.\" 4.\" This software was developed by Robert N. M. Watson under contract to 5.\" Seccuris Inc. 6.\" 7.\" Redistribution and use in source and binary forms, with or without 8.\" modification, are permitted provided that the following conditions 9.\" are met: 10.\" 1. Redistributions of source code must retain the above copyright 11.\" notice, this list of conditions and the following disclaimer. 12.\" 2. Redistributions in binary form must reproduce the above copyright 13.\" notice, this list of conditions and the following disclaimer in the 14.\" documentation and/or other materials provided with the distribution. 15.\" 16.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 19.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 26.\" SUCH DAMAGE. 27.\" 28.\" Copyright (c) 1990 The Regents of the University of California. 29.\" All rights reserved. 30.\" 31.\" Redistribution and use in source and binary forms, with or without 32.\" modification, are permitted provided that: (1) source code distributions 33.\" retain the above copyright notice and this paragraph in its entirety, (2) 34.\" distributions including binary code include the above copyright notice and 35.\" this paragraph in its entirety in the documentation or other materials 36.\" provided with the distribution, and (3) all advertising materials mentioning 37.\" features or use of this software display the following acknowledgement: 38.\" ``This product includes software developed by the University of California, 39.\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of 40.\" the University nor the names of its contributors may be used to endorse 41.\" or promote products derived from this software without specific prior 42.\" written permission. 43.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED 44.\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF 45.\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. 46.\" 47.\" This document is derived in part from the enet man page (enet.4) 48.\" distributed with 4.3BSD Unix. 49.\" 50.Dd October 13, 2021 51.Dt BPF 4 52.Os 53.Sh NAME 54.Nm bpf 55.Nd Berkeley Packet Filter 56.Sh SYNOPSIS 57.Cd device bpf 58.Sh DESCRIPTION 59The Berkeley Packet Filter 60provides a raw interface to data link layers in a protocol 61independent fashion. 62All packets on the network, even those destined for other hosts, 63are accessible through this mechanism. 64.Pp 65The packet filter appears as a character special device, 66.Pa /dev/bpf . 67After opening the device, the file descriptor must be bound to a 68specific network interface with the 69.Dv BIOCSETIF 70ioctl. 71A given interface can be shared by multiple listeners, and the filter 72underlying each descriptor will see an identical packet stream. 73.Pp 74Associated with each open instance of a 75.Nm 76file is a user-settable packet filter. 77Whenever a packet is received by an interface, 78all file descriptors listening on that interface apply their filter. 79Each descriptor that accepts the packet receives its own copy. 80.Pp 81A packet can be sent out on the network by writing to a 82.Nm 83file descriptor. 84The writes are unbuffered, meaning only one packet can be processed per write. 85Currently, only writes to Ethernets and 86.Tn SLIP 87links are supported. 88.Sh BUFFER MODES 89.Nm 90devices deliver packet data to the application via memory buffers provided by 91the application. 92The buffer mode is set using the 93.Dv BIOCSETBUFMODE 94ioctl, and read using the 95.Dv BIOCGETBUFMODE 96ioctl. 97.Ss Buffered read mode 98By default, 99.Nm 100devices operate in the 101.Dv BPF_BUFMODE_BUFFER 102mode, in which packet data is copied explicitly from kernel to user memory 103using the 104.Xr read 2 105system call. 106The user process will declare a fixed buffer size that will be used both for 107sizing internal buffers and for all 108.Xr read 2 109operations on the file. 110This size is queried using the 111.Dv BIOCGBLEN 112ioctl, and is set using the 113.Dv BIOCSBLEN 114ioctl. 115Note that an individual packet larger than the buffer size is necessarily 116truncated. 117.Ss Zero-copy buffer mode 118.Nm 119devices may also operate in the 120.Dv BPF_BUFMODE_ZEROCOPY 121mode, in which packet data is written directly into two user memory buffers 122by the kernel, avoiding both system call and copying overhead. 123Buffers are of fixed (and equal) size, page-aligned, and an even multiple of 124the page size. 125The maximum zero-copy buffer size is returned by the 126.Dv BIOCGETZMAX 127ioctl. 128Note that an individual packet larger than the buffer size is necessarily 129truncated. 130.Pp 131The user process registers two memory buffers using the 132.Dv BIOCSETZBUF 133ioctl, which accepts a 134.Vt struct bpf_zbuf 135pointer as an argument: 136.Bd -literal 137struct bpf_zbuf { 138 void *bz_bufa; 139 void *bz_bufb; 140 size_t bz_buflen; 141}; 142.Ed 143.Pp 144.Vt bz_bufa 145is a pointer to the userspace address of the first buffer that will be 146filled, and 147.Vt bz_bufb 148is a pointer to the second buffer. 149.Nm 150will then cycle between the two buffers as they fill and are acknowledged. 151.Pp 152Each buffer begins with a fixed-length header to hold synchronization and 153data length information for the buffer: 154.Bd -literal 155struct bpf_zbuf_header { 156 volatile u_int bzh_kernel_gen; /* Kernel generation number. */ 157 volatile u_int bzh_kernel_len; /* Length of data in the buffer. */ 158 volatile u_int bzh_user_gen; /* User generation number. */ 159 /* ...padding for future use... */ 160}; 161.Ed 162.Pp 163The header structure of each buffer, including all padding, should be zeroed 164before it is configured using 165.Dv BIOCSETZBUF . 166Remaining space in the buffer will be used by the kernel to store packet 167data, laid out in the same format as with buffered read mode. 168.Pp 169The kernel and the user process follow a simple acknowledgement protocol via 170the buffer header to synchronize access to the buffer: when the header 171generation numbers, 172.Vt bzh_kernel_gen 173and 174.Vt bzh_user_gen , 175hold the same value, the kernel owns the buffer, and when they differ, 176userspace owns the buffer. 177.Pp 178While the kernel owns the buffer, the contents are unstable and may change 179asynchronously; while the user process owns the buffer, its contents are 180stable and will not be changed until the buffer has been acknowledged. 181.Pp 182Initializing the buffer headers to all 0's before registering the buffer has 183the effect of assigning initial ownership of both buffers to the kernel. 184The kernel signals that a buffer has been assigned to userspace by modifying 185.Vt bzh_kernel_gen , 186and userspace acknowledges the buffer and returns it to the kernel by setting 187the value of 188.Vt bzh_user_gen 189to the value of 190.Vt bzh_kernel_gen . 191.Pp 192In order to avoid caching and memory re-ordering effects, the user process 193must use atomic operations and memory barriers when checking for and 194acknowledging buffers: 195.Bd -literal 196#include <machine/atomic.h> 197 198/* 199 * Return ownership of a buffer to the kernel for reuse. 200 */ 201static void 202buffer_acknowledge(struct bpf_zbuf_header *bzh) 203{ 204 205 atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen); 206} 207 208/* 209 * Check whether a buffer has been assigned to userspace by the kernel. 210 * Return true if userspace owns the buffer, and false otherwise. 211 */ 212static int 213buffer_check(struct bpf_zbuf_header *bzh) 214{ 215 216 return (bzh->bzh_user_gen != 217 atomic_load_acq_int(&bzh->bzh_kernel_gen)); 218} 219.Ed 220.Pp 221The user process may force the assignment of the next buffer, if any data 222is pending, to userspace using the 223.Dv BIOCROTZBUF 224ioctl. 225This allows the user process to retrieve data in a partially filled buffer 226before the buffer is full, such as following a timeout; the process must 227recheck for buffer ownership using the header generation numbers, as the 228buffer will not be assigned to userspace if no data was present. 229.Pp 230As in the buffered read mode, 231.Xr kqueue 2 , 232.Xr poll 2 , 233and 234.Xr select 2 235may be used to sleep awaiting the availability of a completed buffer. 236They will return a readable file descriptor when ownership of the next buffer 237is assigned to user space. 238.Pp 239In the current implementation, the kernel may assign zero, one, or both 240buffers to the user process; however, an earlier implementation maintained 241the invariant that at most one buffer could be assigned to the user process 242at a time. 243In order to both ensure progress and high performance, user processes should 244acknowledge a completely processed buffer as quickly as possible, returning 245it for reuse, and not block waiting on a second buffer while holding another 246buffer. 247.Sh IOCTLS 248The 249.Xr ioctl 2 250command codes below are defined in 251.In net/bpf.h . 252All commands require 253these includes: 254.Bd -literal 255 #include <sys/types.h> 256 #include <sys/time.h> 257 #include <sys/ioctl.h> 258 #include <net/bpf.h> 259.Ed 260.Pp 261Additionally, 262.Dv BIOCGETIF 263and 264.Dv BIOCSETIF 265require 266.In sys/socket.h 267and 268.In net/if.h . 269.Pp 270In addition to 271.Dv FIONREAD 272the following commands may be applied to any open 273.Nm 274file. 275The (third) argument to 276.Xr ioctl 2 277should be a pointer to the type indicated. 278.Bl -tag -width BIOCGETBUFMODE 279.It Dv BIOCGBLEN 280.Pq Li u_int 281Returns the required buffer length for reads on 282.Nm 283files. 284.It Dv BIOCSBLEN 285.Pq Li u_int 286Sets the buffer length for reads on 287.Nm 288files. 289The buffer must be set before the file is attached to an interface 290with 291.Dv BIOCSETIF . 292If the requested buffer size cannot be accommodated, the closest 293allowable size will be set and returned in the argument. 294A read call will result in 295.Er EINVAL 296if it is passed a buffer that is not this size. 297.It Dv BIOCGDLT 298.Pq Li u_int 299Returns the type of the data link layer underlying the attached interface. 300.Er EINVAL 301is returned if no interface has been specified. 302The device types, prefixed with 303.Dq Li DLT_ , 304are defined in 305.In net/bpf.h . 306.It Dv BIOCGDLTLIST 307.Pq Li "struct bpf_dltlist" 308Returns an array of the available types of the data link layer 309underlying the attached interface: 310.Bd -literal -offset indent 311struct bpf_dltlist { 312 u_int bfl_len; 313 u_int *bfl_list; 314}; 315.Ed 316.Pp 317The available types are returned in the array pointed to by the 318.Va bfl_list 319field while their length in u_int is supplied to the 320.Va bfl_len 321field. 322.Er ENOMEM 323is returned if there is not enough buffer space and 324.Er EFAULT 325is returned if a bad address is encountered. 326The 327.Va bfl_len 328field is modified on return to indicate the actual length in u_int 329of the array returned. 330If 331.Va bfl_list 332is 333.Dv NULL , 334the 335.Va bfl_len 336field is set to indicate the required length of an array in u_int. 337.It Dv BIOCSDLT 338.Pq Li u_int 339Changes the type of the data link layer underlying the attached interface. 340.Er EINVAL 341is returned if no interface has been specified or the specified 342type is not available for the interface. 343.It Dv BIOCPROMISC 344Forces the interface into promiscuous mode. 345All packets, not just those destined for the local host, are processed. 346Since more than one file can be listening on a given interface, 347a listener that opened its interface non-promiscuously may receive 348packets promiscuously. 349This problem can be remedied with an appropriate filter. 350.Pp 351The interface remains in promiscuous mode until all files listening 352promiscuously are closed. 353.It Dv BIOCFLUSH 354Flushes the buffer of incoming packets, 355and resets the statistics that are returned by BIOCGSTATS. 356.It Dv BIOCGETIF 357.Pq Li "struct ifreq" 358Returns the name of the hardware interface that the file is listening on. 359The name is returned in the ifr_name field of 360the 361.Li ifreq 362structure. 363All other fields are undefined. 364.It Dv BIOCSETIF 365.Pq Li "struct ifreq" 366Sets the hardware interface associated with the file. 367This 368command must be performed before any packets can be read. 369The device is indicated by name using the 370.Li ifr_name 371field of the 372.Li ifreq 373structure. 374Additionally, performs the actions of 375.Dv BIOCFLUSH . 376.It Dv BIOCSRTIMEOUT 377.It Dv BIOCGRTIMEOUT 378.Pq Li "struct timeval" 379Sets or gets the read timeout parameter. 380The argument 381specifies the length of time to wait before timing 382out on a read request. 383This parameter is initialized to zero by 384.Xr open 2 , 385indicating no timeout. 386.It Dv BIOCGSTATS 387.Pq Li "struct bpf_stat" 388Returns the following structure of packet statistics: 389.Bd -literal 390struct bpf_stat { 391 u_int bs_recv; /* number of packets received */ 392 u_int bs_drop; /* number of packets dropped */ 393}; 394.Ed 395.Pp 396The fields are: 397.Bl -hang -offset indent 398.It Li bs_recv 399the number of packets received by the descriptor since opened or reset 400(including any buffered since the last read call); 401and 402.It Li bs_drop 403the number of packets which were accepted by the filter but dropped by the 404kernel because of buffer overflows 405(i.e., the application's reads are not keeping up with the packet traffic). 406.El 407.It Dv BIOCIMMEDIATE 408.Pq Li u_int 409Enables or disables 410.Dq immediate mode , 411based on the truth value of the argument. 412When immediate mode is enabled, reads return immediately upon packet 413reception. 414Otherwise, a read will block until either the kernel buffer 415becomes full or a timeout occurs. 416This is useful for programs like 417.Xr rarpd 8 418which must respond to messages in real time. 419The default for a new file is off. 420.It Dv BIOCSETF 421.It Dv BIOCSETFNR 422.Pq Li "struct bpf_program" 423Sets the read filter program used by the kernel to discard uninteresting 424packets. 425An array of instructions and its length is passed in using 426the following structure: 427.Bd -literal 428struct bpf_program { 429 u_int bf_len; 430 struct bpf_insn *bf_insns; 431}; 432.Ed 433.Pp 434The filter program is pointed to by the 435.Li bf_insns 436field while its length in units of 437.Sq Li struct bpf_insn 438is given by the 439.Li bf_len 440field. 441See section 442.Sx "FILTER MACHINE" 443for an explanation of the filter language. 444The only difference between 445.Dv BIOCSETF 446and 447.Dv BIOCSETFNR 448is 449.Dv BIOCSETF 450performs the actions of 451.Dv BIOCFLUSH 452while 453.Dv BIOCSETFNR 454does not. 455.It Dv BIOCSETWF 456.Pq Li "struct bpf_program" 457Sets the write filter program used by the kernel to control what type of 458packets can be written to the interface. 459See the 460.Dv BIOCSETF 461command for more 462information on the 463.Nm 464filter program. 465.It Dv BIOCVERSION 466.Pq Li "struct bpf_version" 467Returns the major and minor version numbers of the filter language currently 468recognized by the kernel. 469Before installing a filter, applications must check 470that the current version is compatible with the running kernel. 471Version numbers are compatible if the major numbers match and the application minor 472is less than or equal to the kernel minor. 473The kernel version number is returned in the following structure: 474.Bd -literal 475struct bpf_version { 476 u_short bv_major; 477 u_short bv_minor; 478}; 479.Ed 480.Pp 481The current version numbers are given by 482.Dv BPF_MAJOR_VERSION 483and 484.Dv BPF_MINOR_VERSION 485from 486.In net/bpf.h . 487An incompatible filter 488may result in undefined behavior (most likely, an error returned by 489.Fn ioctl 490or haphazard packet matching). 491.It Dv BIOCGRSIG 492.It Dv BIOCSRSIG 493.Pq Li u_int 494Sets or gets the receive signal. 495This signal will be sent to the process or process group specified by 496.Dv FIOSETOWN . 497It defaults to 498.Dv SIGIO . 499.It Dv BIOCSHDRCMPLT 500.It Dv BIOCGHDRCMPLT 501.Pq Li u_int 502Sets or gets the status of the 503.Dq header complete 504flag. 505Set to zero if the link level source address should be filled in automatically 506by the interface output routine. 507Set to one if the link level source 508address will be written, as provided, to the wire. 509This flag is initialized to zero by default. 510.It Dv BIOCSSEESENT 511.It Dv BIOCGSEESENT 512.Pq Li u_int 513These commands are obsolete but left for compatibility. 514Use 515.Dv BIOCSDIRECTION 516and 517.Dv BIOCGDIRECTION 518instead. 519Sets or gets the flag determining whether locally generated packets on the 520interface should be returned by BPF. 521Set to zero to see only incoming packets on the interface. 522Set to one to see packets originating locally and remotely on the interface. 523This flag is initialized to one by default. 524.It Dv BIOCSDIRECTION 525.It Dv BIOCGDIRECTION 526.Pq Li u_int 527Sets or gets the setting determining whether incoming, outgoing, or all packets 528on the interface should be returned by BPF. 529Set to 530.Dv BPF_D_IN 531to see only incoming packets on the interface. 532Set to 533.Dv BPF_D_INOUT 534to see packets originating locally and remotely on the interface. 535Set to 536.Dv BPF_D_OUT 537to see only outgoing packets on the interface. 538This setting is initialized to 539.Dv BPF_D_INOUT 540by default. 541.It Dv BIOCSTSTAMP 542.It Dv BIOCGTSTAMP 543.Pq Li u_int 544Set or get format and resolution of the time stamps returned by BPF. 545Set to 546.Dv BPF_T_MICROTIME , 547.Dv BPF_T_MICROTIME_FAST , 548.Dv BPF_T_MICROTIME_MONOTONIC , 549or 550.Dv BPF_T_MICROTIME_MONOTONIC_FAST 551to get time stamps in 64-bit 552.Vt struct timeval 553format. 554Set to 555.Dv BPF_T_NANOTIME , 556.Dv BPF_T_NANOTIME_FAST , 557.Dv BPF_T_NANOTIME_MONOTONIC , 558or 559.Dv BPF_T_NANOTIME_MONOTONIC_FAST 560to get time stamps in 64-bit 561.Vt struct timespec 562format. 563Set to 564.Dv BPF_T_BINTIME , 565.Dv BPF_T_BINTIME_FAST , 566.Dv BPF_T_NANOTIME_MONOTONIC , 567or 568.Dv BPF_T_BINTIME_MONOTONIC_FAST 569to get time stamps in 64-bit 570.Vt struct bintime 571format. 572Set to 573.Dv BPF_T_NONE 574to ignore time stamp. 575All 64-bit time stamp formats are wrapped in 576.Vt struct bpf_ts . 577The 578.Dv BPF_T_MICROTIME_FAST , 579.Dv BPF_T_NANOTIME_FAST , 580.Dv BPF_T_BINTIME_FAST , 581.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 582.Dv BPF_T_NANOTIME_MONOTONIC_FAST , 583and 584.Dv BPF_T_BINTIME_MONOTONIC_FAST 585are analogs of corresponding formats without _FAST suffix but do not perform 586a full time counter query, so their accuracy is one timer tick. 587The 588.Dv BPF_T_MICROTIME_MONOTONIC , 589.Dv BPF_T_NANOTIME_MONOTONIC , 590.Dv BPF_T_BINTIME_MONOTONIC , 591.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 592.Dv BPF_T_NANOTIME_MONOTONIC_FAST , 593and 594.Dv BPF_T_BINTIME_MONOTONIC_FAST 595store the time elapsed since kernel boot. 596This setting is initialized to 597.Dv BPF_T_MICROTIME 598by default. 599.It Dv BIOCFEEDBACK 600.Pq Li u_int 601Set packet feedback mode. 602This allows injected packets to be fed back as input to the interface when 603output via the interface is successful. 604When 605.Dv BPF_D_INOUT 606direction is set, injected outgoing packet is not returned by BPF to avoid 607duplication. 608This flag is initialized to zero by default. 609.It Dv BIOCLOCK 610Set the locked flag on the 611.Nm 612descriptor. 613This prevents the execution of 614ioctl commands which could change the underlying operating parameters of 615the device. 616.It Dv BIOCGETBUFMODE 617.It Dv BIOCSETBUFMODE 618.Pq Li u_int 619Get or set the current 620.Nm 621buffering mode; possible values are 622.Dv BPF_BUFMODE_BUFFER , 623buffered read mode, and 624.Dv BPF_BUFMODE_ZBUF , 625zero-copy buffer mode. 626.It Dv BIOCSETZBUF 627.Pq Li struct bpf_zbuf 628Set the current zero-copy buffer locations; buffer locations may be 629set only once zero-copy buffer mode has been selected, and prior to attaching 630to an interface. 631Buffers must be of identical size, page-aligned, and an integer multiple of 632pages in size. 633The three fields 634.Vt bz_bufa , 635.Vt bz_bufb , 636and 637.Vt bz_buflen 638must be filled out. 639If buffers have already been set for this device, the ioctl will fail. 640.It Dv BIOCGETZMAX 641.Pq Li size_t 642Get the largest individual zero-copy buffer size allowed. 643As two buffers are used in zero-copy buffer mode, the limit (in practice) is 644twice the returned size. 645As zero-copy buffers consume kernel address space, conservative selection of 646buffer size is suggested, especially when there are multiple 647.Nm 648descriptors in use on 32-bit systems. 649.It Dv BIOCROTZBUF 650Force ownership of the next buffer to be assigned to userspace, if any data 651present in the buffer. 652If no data is present, the buffer will remain owned by the kernel. 653This allows consumers of zero-copy buffering to implement timeouts and 654retrieve partially filled buffers. 655In order to handle the case where no data is present in the buffer and 656therefore ownership is not assigned, the user process must check 657.Vt bzh_kernel_gen 658against 659.Vt bzh_user_gen . 660.It Dv BIOCSETVLANPCP 661Set the VLAN PCP bits to the supplied value. 662.El 663.Sh STANDARD IOCTLS 664.Nm 665now supports several standard 666.Xr ioctl 2 Ns 's 667which allow the user to do async and/or non-blocking I/O to an open 668.I bpf 669file descriptor. 670.Bl -tag -width SIOCGIFADDR 671.It Dv FIONREAD 672.Pq Li int 673Returns the number of bytes that are immediately available for reading. 674.It Dv SIOCGIFADDR 675.Pq Li "struct ifreq" 676Returns the address associated with the interface. 677.It Dv FIONBIO 678.Pq Li int 679Sets or clears non-blocking I/O. 680If arg is non-zero, then doing a 681.Xr read 2 682when no data is available will return -1 and 683.Va errno 684will be set to 685.Er EAGAIN . 686If arg is zero, non-blocking I/O is disabled. 687Note: setting this overrides the timeout set by 688.Dv BIOCSRTIMEOUT . 689.It Dv FIOASYNC 690.Pq Li int 691Enables or disables async I/O. 692When enabled (arg is non-zero), the process or process group specified by 693.Dv FIOSETOWN 694will start receiving 695.Dv SIGIO 's 696when packets arrive. 697Note that you must do an 698.Dv FIOSETOWN 699in order for this to take effect, 700as the system will not default this for you. 701The signal may be changed via 702.Dv BIOCSRSIG . 703.It Dv FIOSETOWN 704.It Dv FIOGETOWN 705.Pq Li int 706Sets or gets the process or process group (if negative) that should 707receive 708.Dv SIGIO 709when packets are available. 710The signal may be changed using 711.Dv BIOCSRSIG 712(see above). 713.El 714.Sh BPF HEADER 715One of the following structures is prepended to each packet returned by 716.Xr read 2 717or via a zero-copy buffer: 718.Bd -literal 719struct bpf_xhdr { 720 struct bpf_ts bh_tstamp; /* time stamp */ 721 uint32_t bh_caplen; /* length of captured portion */ 722 uint32_t bh_datalen; /* original length of packet */ 723 u_short bh_hdrlen; /* length of bpf header (this struct 724 plus alignment padding) */ 725}; 726 727struct bpf_hdr { 728 struct timeval bh_tstamp; /* time stamp */ 729 uint32_t bh_caplen; /* length of captured portion */ 730 uint32_t bh_datalen; /* original length of packet */ 731 u_short bh_hdrlen; /* length of bpf header (this struct 732 plus alignment padding) */ 733}; 734.Ed 735.Pp 736The fields, whose values are stored in host order, and are: 737.Pp 738.Bl -tag -compact -width bh_datalen 739.It Li bh_tstamp 740The time at which the packet was processed by the packet filter. 741.It Li bh_caplen 742The length of the captured portion of the packet. 743This is the minimum of 744the truncation amount specified by the filter and the length of the packet. 745.It Li bh_datalen 746The length of the packet off the wire. 747This value is independent of the truncation amount specified by the filter. 748.It Li bh_hdrlen 749The length of the 750.Nm 751header, which may not be equal to 752.\" XXX - not really a function call 753.Fn sizeof "struct bpf_xhdr" 754or 755.Fn sizeof "struct bpf_hdr" . 756.El 757.Pp 758The 759.Li bh_hdrlen 760field exists to account for 761padding between the header and the link level protocol. 762The purpose here is to guarantee proper alignment of the packet 763data structures, which is required on alignment sensitive 764architectures and improves performance on many other architectures. 765The packet filter ensures that the 766.Vt bpf_xhdr , 767.Vt bpf_hdr 768and the network layer 769header will be word aligned. 770Currently, 771.Vt bpf_hdr 772is used when the time stamp is set to 773.Dv BPF_T_MICROTIME , 774.Dv BPF_T_MICROTIME_FAST , 775.Dv BPF_T_MICROTIME_MONOTONIC , 776.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 777or 778.Dv BPF_T_NONE 779for backward compatibility reasons. 780Otherwise, 781.Vt bpf_xhdr 782is used. 783However, 784.Vt bpf_hdr 785may be deprecated in the near future. 786Suitable precautions 787must be taken when accessing the link layer protocol fields on alignment 788restricted machines. 789(This is not a problem on an Ethernet, since 790the type field is a short falling on an even offset, 791and the addresses are probably accessed in a bytewise fashion). 792.Pp 793Additionally, individual packets are padded so that each starts 794on a word boundary. 795This requires that an application 796has some knowledge of how to get from packet to packet. 797The macro 798.Dv BPF_WORDALIGN 799is defined in 800.In net/bpf.h 801to facilitate 802this process. 803It rounds up its argument to the nearest word aligned value (where a word is 804.Dv BPF_ALIGNMENT 805bytes wide). 806.Pp 807For example, if 808.Sq Li p 809points to the start of a packet, this expression 810will advance it to the next packet: 811.Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen) 812.Pp 813For the alignment mechanisms to work properly, the 814buffer passed to 815.Xr read 2 816must itself be word aligned. 817The 818.Xr malloc 3 819function 820will always return an aligned buffer. 821.Sh FILTER MACHINE 822A filter program is an array of instructions, with all branches forwardly 823directed, terminated by a 824.Em return 825instruction. 826Each instruction performs some action on the pseudo-machine state, 827which consists of an accumulator, index register, scratch memory store, 828and implicit program counter. 829.Pp 830The following structure defines the instruction format: 831.Bd -literal 832struct bpf_insn { 833 u_short code; 834 u_char jt; 835 u_char jf; 836 bpf_u_int32 k; 837}; 838.Ed 839.Pp 840The 841.Li k 842field is used in different ways by different instructions, 843and the 844.Li jt 845and 846.Li jf 847fields are used as offsets 848by the branch instructions. 849The opcodes are encoded in a semi-hierarchical fashion. 850There are eight classes of instructions: 851.Dv BPF_LD , 852.Dv BPF_LDX , 853.Dv BPF_ST , 854.Dv BPF_STX , 855.Dv BPF_ALU , 856.Dv BPF_JMP , 857.Dv BPF_RET , 858and 859.Dv BPF_MISC . 860Various other mode and 861operator bits are or'd into the class to give the actual instructions. 862The classes and modes are defined in 863.In net/bpf.h . 864.Pp 865Below are the semantics for each defined 866.Nm 867instruction. 868We use the convention that A is the accumulator, X is the index register, 869P[] packet data, and M[] scratch memory store. 870P[i:n] gives the data at byte offset 871.Dq i 872in the packet, 873interpreted as a word (n=4), 874unsigned halfword (n=2), or unsigned byte (n=1). 875M[i] gives the i'th word in the scratch memory store, which is only 876addressed in word units. 877The memory store is indexed from 0 to 878.Dv BPF_MEMWORDS 879- 1. 880.Li k , 881.Li jt , 882and 883.Li jf 884are the corresponding fields in the 885instruction definition. 886.Dq len 887refers to the length of the packet. 888.Bl -tag -width BPF_STXx 889.It Dv BPF_LD 890These instructions copy a value into the accumulator. 891The type of the source operand is specified by an 892.Dq addressing mode 893and can be a constant 894.Pq Dv BPF_IMM , 895packet data at a fixed offset 896.Pq Dv BPF_ABS , 897packet data at a variable offset 898.Pq Dv BPF_IND , 899the packet length 900.Pq Dv BPF_LEN , 901or a word in the scratch memory store 902.Pq Dv BPF_MEM . 903For 904.Dv BPF_IND 905and 906.Dv BPF_ABS , 907the data size must be specified as a word 908.Pq Dv BPF_W , 909halfword 910.Pq Dv BPF_H , 911or byte 912.Pq Dv BPF_B . 913The semantics of all the recognized 914.Dv BPF_LD 915instructions follow. 916.Bd -literal 917BPF_LD+BPF_W+BPF_ABS A <- P[k:4] 918BPF_LD+BPF_H+BPF_ABS A <- P[k:2] 919BPF_LD+BPF_B+BPF_ABS A <- P[k:1] 920BPF_LD+BPF_W+BPF_IND A <- P[X+k:4] 921BPF_LD+BPF_H+BPF_IND A <- P[X+k:2] 922BPF_LD+BPF_B+BPF_IND A <- P[X+k:1] 923BPF_LD+BPF_W+BPF_LEN A <- len 924BPF_LD+BPF_IMM A <- k 925BPF_LD+BPF_MEM A <- M[k] 926.Ed 927.It Dv BPF_LDX 928These instructions load a value into the index register. 929Note that 930the addressing modes are more restrictive than those of the accumulator loads, 931but they include 932.Dv BPF_MSH , 933a hack for efficiently loading the IP header length. 934.Bd -literal 935BPF_LDX+BPF_W+BPF_IMM X <- k 936BPF_LDX+BPF_W+BPF_MEM X <- M[k] 937BPF_LDX+BPF_W+BPF_LEN X <- len 938BPF_LDX+BPF_B+BPF_MSH X <- 4*(P[k:1]&0xf) 939.Ed 940.It Dv BPF_ST 941This instruction stores the accumulator into the scratch memory. 942We do not need an addressing mode since there is only one possibility 943for the destination. 944.Bd -literal 945BPF_ST M[k] <- A 946.Ed 947.It Dv BPF_STX 948This instruction stores the index register in the scratch memory store. 949.Bd -literal 950BPF_STX M[k] <- X 951.Ed 952.It Dv BPF_ALU 953The alu instructions perform operations between the accumulator and 954index register or constant, and store the result back in the accumulator. 955For binary operations, a source mode is required 956.Dv ( BPF_K 957or 958.Dv BPF_X ) . 959.Bd -literal 960BPF_ALU+BPF_ADD+BPF_K A <- A + k 961BPF_ALU+BPF_SUB+BPF_K A <- A - k 962BPF_ALU+BPF_MUL+BPF_K A <- A * k 963BPF_ALU+BPF_DIV+BPF_K A <- A / k 964BPF_ALU+BPF_MOD+BPF_K A <- A % k 965BPF_ALU+BPF_AND+BPF_K A <- A & k 966BPF_ALU+BPF_OR+BPF_K A <- A | k 967BPF_ALU+BPF_XOR+BPF_K A <- A ^ k 968BPF_ALU+BPF_LSH+BPF_K A <- A << k 969BPF_ALU+BPF_RSH+BPF_K A <- A >> k 970BPF_ALU+BPF_ADD+BPF_X A <- A + X 971BPF_ALU+BPF_SUB+BPF_X A <- A - X 972BPF_ALU+BPF_MUL+BPF_X A <- A * X 973BPF_ALU+BPF_DIV+BPF_X A <- A / X 974BPF_ALU+BPF_MOD+BPF_X A <- A % X 975BPF_ALU+BPF_AND+BPF_X A <- A & X 976BPF_ALU+BPF_OR+BPF_X A <- A | X 977BPF_ALU+BPF_XOR+BPF_X A <- A ^ X 978BPF_ALU+BPF_LSH+BPF_X A <- A << X 979BPF_ALU+BPF_RSH+BPF_X A <- A >> X 980BPF_ALU+BPF_NEG A <- -A 981.Ed 982.It Dv BPF_JMP 983The jump instructions alter flow of control. 984Conditional jumps 985compare the accumulator against a constant 986.Pq Dv BPF_K 987or the index register 988.Pq Dv BPF_X . 989If the result is true (or non-zero), 990the true branch is taken, otherwise the false branch is taken. 991Jump offsets are encoded in 8 bits so the longest jump is 256 instructions. 992However, the jump always 993.Pq Dv BPF_JA 994opcode uses the 32 bit 995.Li k 996field as the offset, allowing arbitrarily distant destinations. 997All conditionals use unsigned comparison conventions. 998.Bd -literal 999BPF_JMP+BPF_JA pc += k 1000BPF_JMP+BPF_JGT+BPF_K pc += (A > k) ? jt : jf 1001BPF_JMP+BPF_JGE+BPF_K pc += (A >= k) ? jt : jf 1002BPF_JMP+BPF_JEQ+BPF_K pc += (A == k) ? jt : jf 1003BPF_JMP+BPF_JSET+BPF_K pc += (A & k) ? jt : jf 1004BPF_JMP+BPF_JGT+BPF_X pc += (A > X) ? jt : jf 1005BPF_JMP+BPF_JGE+BPF_X pc += (A >= X) ? jt : jf 1006BPF_JMP+BPF_JEQ+BPF_X pc += (A == X) ? jt : jf 1007BPF_JMP+BPF_JSET+BPF_X pc += (A & X) ? jt : jf 1008.Ed 1009.It Dv BPF_RET 1010The return instructions terminate the filter program and specify the amount 1011of packet to accept (i.e., they return the truncation amount). 1012A return value of zero indicates that the packet should be ignored. 1013The return value is either a constant 1014.Pq Dv BPF_K 1015or the accumulator 1016.Pq Dv BPF_A . 1017.Bd -literal 1018BPF_RET+BPF_A accept A bytes 1019BPF_RET+BPF_K accept k bytes 1020.Ed 1021.It Dv BPF_MISC 1022The miscellaneous category was created for anything that does not 1023fit into the above classes, and for any new instructions that might need to 1024be added. 1025Currently, these are the register transfer instructions 1026that copy the index register to the accumulator or vice versa. 1027.Bd -literal 1028BPF_MISC+BPF_TAX X <- A 1029BPF_MISC+BPF_TXA A <- X 1030.Ed 1031.El 1032.Pp 1033The 1034.Nm 1035interface provides the following macros to facilitate 1036array initializers: 1037.Fn BPF_STMT opcode operand 1038and 1039.Fn BPF_JUMP opcode operand true_offset false_offset . 1040.Sh SYSCTL VARIABLES 1041A set of 1042.Xr sysctl 8 1043variables controls the behaviour of the 1044.Nm 1045subsystem 1046.Bl -tag -width indent 1047.It Va net.bpf.optimize_writers : No 0 1048Various programs use BPF to send (but not receive) raw packets 1049(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs). 1050They do not need incoming packets to be send to them. 1051Turning this option on 1052makes new BPF users to be attached to write-only interface list until program 1053explicitly specifies read filter via 1054.Fn pcap_set_filter . 1055This removes any performance degradation for high-speed interfaces. 1056.It Va net.bpf.stats : 1057Binary interface for retrieving general statistics. 1058.It Va net.bpf.zerocopy_enable : No 0 1059Permits zero-copy to be used with net BPF readers. 1060Use with caution. 1061.It Va net.bpf.maxinsns : No 512 1062Maximum number of instructions that BPF program can contain. 1063Use 1064.Xr tcpdump 1 1065.Fl d 1066option to determine approximate number of instruction for any filter. 1067.It Va net.bpf.maxbufsize : No 524288 1068Maximum buffer size to allocate for packets buffer. 1069.It Va net.bpf.bufsize : No 4096 1070Default buffer size to allocate for packets buffer. 1071.El 1072.Sh EXAMPLES 1073The following filter is taken from the Reverse ARP Daemon. 1074It accepts only Reverse ARP requests. 1075.Bd -literal 1076struct bpf_insn insns[] = { 1077 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1078 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3), 1079 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20), 1080 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARPOP_REVREQUEST, 0, 1), 1081 BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) + 1082 sizeof(struct ether_header)), 1083 BPF_STMT(BPF_RET+BPF_K, 0), 1084}; 1085.Ed 1086.Pp 1087This filter accepts only IP packets between host 128.3.112.15 and 1088128.3.112.35. 1089.Bd -literal 1090struct bpf_insn insns[] = { 1091 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1092 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8), 1093 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26), 1094 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2), 1095 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30), 1096 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4), 1097 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3), 1098 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30), 1099 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1), 1100 BPF_STMT(BPF_RET+BPF_K, (u_int)-1), 1101 BPF_STMT(BPF_RET+BPF_K, 0), 1102}; 1103.Ed 1104.Pp 1105Finally, this filter returns only TCP finger packets. 1106We must parse the IP header to reach the TCP header. 1107The 1108.Dv BPF_JSET 1109instruction 1110checks that the IP fragment offset is 0 so we are sure 1111that we have a TCP header. 1112.Bd -literal 1113struct bpf_insn insns[] = { 1114 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1115 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10), 1116 BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23), 1117 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8), 1118 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20), 1119 BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0), 1120 BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14), 1121 BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14), 1122 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0), 1123 BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16), 1124 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1), 1125 BPF_STMT(BPF_RET+BPF_K, (u_int)-1), 1126 BPF_STMT(BPF_RET+BPF_K, 0), 1127}; 1128.Ed 1129.Sh SEE ALSO 1130.Xr tcpdump 1 , 1131.Xr ioctl 2 , 1132.Xr kqueue 2 , 1133.Xr poll 2 , 1134.Xr select 2 , 1135.Xr ng_bpf 4 , 1136.Xr bpf 9 1137.Rs 1138.%A McCanne, S. 1139.%A Jacobson V. 1140.%T "An efficient, extensible, and portable network monitor" 1141.Re 1142.Sh HISTORY 1143The Enet packet filter was created in 1980 by Mike Accetta and 1144Rick Rashid at Carnegie-Mellon University. 1145Jeffrey Mogul, at 1146Stanford, ported the code to 1147.Bx 1148and continued its development from 11491983 on. 1150Since then, it has evolved into the Ultrix Packet Filter at 1151.Tn DEC , 1152a 1153.Tn STREAMS 1154.Tn NIT 1155module under 1156.Tn SunOS 4.1 , 1157and 1158.Tn BPF . 1159.Sh AUTHORS 1160.An -nosplit 1161.An Steven McCanne , 1162of Lawrence Berkeley Laboratory, implemented BPF in 1163Summer 1990. 1164Much of the design is due to 1165.An Van Jacobson . 1166.Pp 1167Support for zero-copy buffers was added by 1168.An Robert N. M. Watson 1169under contract to Seccuris Inc. 1170.Sh BUGS 1171The read buffer must be of a fixed size (returned by the 1172.Dv BIOCGBLEN 1173ioctl). 1174.Pp 1175A file that does not request promiscuous mode may receive promiscuously 1176received packets as a side effect of another file requesting this 1177mode on the same hardware interface. 1178This could be fixed in the kernel with additional processing overhead. 1179However, we favor the model where 1180all files must assume that the interface is promiscuous, and if 1181so desired, must utilize a filter to reject foreign packets. 1182.Pp 1183The 1184.Dv SEESENT , 1185.Dv DIRECTION , 1186and 1187.Dv FEEDBACK 1188settings have been observed to work incorrectly on some interface 1189types, including those with hardware loopback rather than software loopback, 1190and point-to-point interfaces. 1191They appear to function correctly on a 1192broad range of Ethernet-style interfaces. 1193