1.\" Copyright (c) 2007 Seccuris Inc. 2.\" All rights reserved. 3.\" 4.\" This software was developed by Robert N. M. Watson under contract to 5.\" Seccuris Inc. 6.\" 7.\" Redistribution and use in source and binary forms, with or without 8.\" modification, are permitted provided that the following conditions 9.\" are met: 10.\" 1. Redistributions of source code must retain the above copyright 11.\" notice, this list of conditions and the following disclaimer. 12.\" 2. Redistributions in binary form must reproduce the above copyright 13.\" notice, this list of conditions and the following disclaimer in the 14.\" documentation and/or other materials provided with the distribution. 15.\" 16.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 19.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 26.\" SUCH DAMAGE. 27.\" 28.\" Copyright (c) 1990 The Regents of the University of California. 29.\" All rights reserved. 30.\" 31.\" Redistribution and use in source and binary forms, with or without 32.\" modification, are permitted provided that: (1) source code distributions 33.\" retain the above copyright notice and this paragraph in its entirety, (2) 34.\" distributions including binary code include the above copyright notice and 35.\" this paragraph in its entirety in the documentation or other materials 36.\" provided with the distribution, and (3) all advertising materials mentioning 37.\" features or use of this software display the following acknowledgement: 38.\" ``This product includes software developed by the University of California, 39.\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of 40.\" the University nor the names of its contributors may be used to endorse 41.\" or promote products derived from this software without specific prior 42.\" written permission. 43.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED 44.\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF 45.\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. 46.\" 47.\" This document is derived in part from the enet man page (enet.4) 48.\" distributed with 4.3BSD Unix. 49.\" 50.\" $FreeBSD$ 51.\" 52.Dd October 9, 2020 53.Dt BPF 4 54.Os 55.Sh NAME 56.Nm bpf 57.Nd Berkeley Packet Filter 58.Sh SYNOPSIS 59.Cd device bpf 60.Sh DESCRIPTION 61The Berkeley Packet Filter 62provides a raw interface to data link layers in a protocol 63independent fashion. 64All packets on the network, even those destined for other hosts, 65are accessible through this mechanism. 66.Pp 67The packet filter appears as a character special device, 68.Pa /dev/bpf . 69After opening the device, the file descriptor must be bound to a 70specific network interface with the 71.Dv BIOCSETIF 72ioctl. 73A given interface can be shared by multiple listeners, and the filter 74underlying each descriptor will see an identical packet stream. 75.Pp 76Associated with each open instance of a 77.Nm 78file is a user-settable packet filter. 79Whenever a packet is received by an interface, 80all file descriptors listening on that interface apply their filter. 81Each descriptor that accepts the packet receives its own copy. 82.Pp 83A packet can be sent out on the network by writing to a 84.Nm 85file descriptor. 86The writes are unbuffered, meaning only one packet can be processed per write. 87Currently, only writes to Ethernets and 88.Tn SLIP 89links are supported. 90.Sh BUFFER MODES 91.Nm 92devices deliver packet data to the application via memory buffers provided by 93the application. 94The buffer mode is set using the 95.Dv BIOCSETBUFMODE 96ioctl, and read using the 97.Dv BIOCGETBUFMODE 98ioctl. 99.Ss Buffered read mode 100By default, 101.Nm 102devices operate in the 103.Dv BPF_BUFMODE_BUFFER 104mode, in which packet data is copied explicitly from kernel to user memory 105using the 106.Xr read 2 107system call. 108The user process will declare a fixed buffer size that will be used both for 109sizing internal buffers and for all 110.Xr read 2 111operations on the file. 112This size is queried using the 113.Dv BIOCGBLEN 114ioctl, and is set using the 115.Dv BIOCSBLEN 116ioctl. 117Note that an individual packet larger than the buffer size is necessarily 118truncated. 119.Ss Zero-copy buffer mode 120.Nm 121devices may also operate in the 122.Dv BPF_BUFMODE_ZEROCOPY 123mode, in which packet data is written directly into two user memory buffers 124by the kernel, avoiding both system call and copying overhead. 125Buffers are of fixed (and equal) size, page-aligned, and an even multiple of 126the page size. 127The maximum zero-copy buffer size is returned by the 128.Dv BIOCGETZMAX 129ioctl. 130Note that an individual packet larger than the buffer size is necessarily 131truncated. 132.Pp 133The user process registers two memory buffers using the 134.Dv BIOCSETZBUF 135ioctl, which accepts a 136.Vt struct bpf_zbuf 137pointer as an argument: 138.Bd -literal 139struct bpf_zbuf { 140 void *bz_bufa; 141 void *bz_bufb; 142 size_t bz_buflen; 143}; 144.Ed 145.Pp 146.Vt bz_bufa 147is a pointer to the userspace address of the first buffer that will be 148filled, and 149.Vt bz_bufb 150is a pointer to the second buffer. 151.Nm 152will then cycle between the two buffers as they fill and are acknowledged. 153.Pp 154Each buffer begins with a fixed-length header to hold synchronization and 155data length information for the buffer: 156.Bd -literal 157struct bpf_zbuf_header { 158 volatile u_int bzh_kernel_gen; /* Kernel generation number. */ 159 volatile u_int bzh_kernel_len; /* Length of data in the buffer. */ 160 volatile u_int bzh_user_gen; /* User generation number. */ 161 /* ...padding for future use... */ 162}; 163.Ed 164.Pp 165The header structure of each buffer, including all padding, should be zeroed 166before it is configured using 167.Dv BIOCSETZBUF . 168Remaining space in the buffer will be used by the kernel to store packet 169data, laid out in the same format as with buffered read mode. 170.Pp 171The kernel and the user process follow a simple acknowledgement protocol via 172the buffer header to synchronize access to the buffer: when the header 173generation numbers, 174.Vt bzh_kernel_gen 175and 176.Vt bzh_user_gen , 177hold the same value, the kernel owns the buffer, and when they differ, 178userspace owns the buffer. 179.Pp 180While the kernel owns the buffer, the contents are unstable and may change 181asynchronously; while the user process owns the buffer, its contents are 182stable and will not be changed until the buffer has been acknowledged. 183.Pp 184Initializing the buffer headers to all 0's before registering the buffer has 185the effect of assigning initial ownership of both buffers to the kernel. 186The kernel signals that a buffer has been assigned to userspace by modifying 187.Vt bzh_kernel_gen , 188and userspace acknowledges the buffer and returns it to the kernel by setting 189the value of 190.Vt bzh_user_gen 191to the value of 192.Vt bzh_kernel_gen . 193.Pp 194In order to avoid caching and memory re-ordering effects, the user process 195must use atomic operations and memory barriers when checking for and 196acknowledging buffers: 197.Bd -literal 198#include <machine/atomic.h> 199 200/* 201 * Return ownership of a buffer to the kernel for reuse. 202 */ 203static void 204buffer_acknowledge(struct bpf_zbuf_header *bzh) 205{ 206 207 atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen); 208} 209 210/* 211 * Check whether a buffer has been assigned to userspace by the kernel. 212 * Return true if userspace owns the buffer, and false otherwise. 213 */ 214static int 215buffer_check(struct bpf_zbuf_header *bzh) 216{ 217 218 return (bzh->bzh_user_gen != 219 atomic_load_acq_int(&bzh->bzh_kernel_gen)); 220} 221.Ed 222.Pp 223The user process may force the assignment of the next buffer, if any data 224is pending, to userspace using the 225.Dv BIOCROTZBUF 226ioctl. 227This allows the user process to retrieve data in a partially filled buffer 228before the buffer is full, such as following a timeout; the process must 229recheck for buffer ownership using the header generation numbers, as the 230buffer will not be assigned to userspace if no data was present. 231.Pp 232As in the buffered read mode, 233.Xr kqueue 2 , 234.Xr poll 2 , 235and 236.Xr select 2 237may be used to sleep awaiting the availability of a completed buffer. 238They will return a readable file descriptor when ownership of the next buffer 239is assigned to user space. 240.Pp 241In the current implementation, the kernel may assign zero, one, or both 242buffers to the user process; however, an earlier implementation maintained 243the invariant that at most one buffer could be assigned to the user process 244at a time. 245In order to both ensure progress and high performance, user processes should 246acknowledge a completely processed buffer as quickly as possible, returning 247it for reuse, and not block waiting on a second buffer while holding another 248buffer. 249.Sh IOCTLS 250The 251.Xr ioctl 2 252command codes below are defined in 253.In net/bpf.h . 254All commands require 255these includes: 256.Bd -literal 257 #include <sys/types.h> 258 #include <sys/time.h> 259 #include <sys/ioctl.h> 260 #include <net/bpf.h> 261.Ed 262.Pp 263Additionally, 264.Dv BIOCGETIF 265and 266.Dv BIOCSETIF 267require 268.In sys/socket.h 269and 270.In net/if.h . 271.Pp 272In addition to 273.Dv FIONREAD 274the following commands may be applied to any open 275.Nm 276file. 277The (third) argument to 278.Xr ioctl 2 279should be a pointer to the type indicated. 280.Bl -tag -width BIOCGETBUFMODE 281.It Dv BIOCGBLEN 282.Pq Li u_int 283Returns the required buffer length for reads on 284.Nm 285files. 286.It Dv BIOCSBLEN 287.Pq Li u_int 288Sets the buffer length for reads on 289.Nm 290files. 291The buffer must be set before the file is attached to an interface 292with 293.Dv BIOCSETIF . 294If the requested buffer size cannot be accommodated, the closest 295allowable size will be set and returned in the argument. 296A read call will result in 297.Er EINVAL 298if it is passed a buffer that is not this size. 299.It Dv BIOCGDLT 300.Pq Li u_int 301Returns the type of the data link layer underlying the attached interface. 302.Er EINVAL 303is returned if no interface has been specified. 304The device types, prefixed with 305.Dq Li DLT_ , 306are defined in 307.In net/bpf.h . 308.It Dv BIOCGDLTLIST 309.Pq Li "struct bpf_dltlist" 310Returns an array of the available types of the data link layer 311underlying the attached interface: 312.Bd -literal -offset indent 313struct bpf_dltlist { 314 u_int bfl_len; 315 u_int *bfl_list; 316}; 317.Ed 318.Pp 319The available types are returned in the array pointed to by the 320.Va bfl_list 321field while their length in u_int is supplied to the 322.Va bfl_len 323field. 324.Er ENOMEM 325is returned if there is not enough buffer space and 326.Er EFAULT 327is returned if a bad address is encountered. 328The 329.Va bfl_len 330field is modified on return to indicate the actual length in u_int 331of the array returned. 332If 333.Va bfl_list 334is 335.Dv NULL , 336the 337.Va bfl_len 338field is set to indicate the required length of an array in u_int. 339.It Dv BIOCSDLT 340.Pq Li u_int 341Changes the type of the data link layer underlying the attached interface. 342.Er EINVAL 343is returned if no interface has been specified or the specified 344type is not available for the interface. 345.It Dv BIOCPROMISC 346Forces the interface into promiscuous mode. 347All packets, not just those destined for the local host, are processed. 348Since more than one file can be listening on a given interface, 349a listener that opened its interface non-promiscuously may receive 350packets promiscuously. 351This problem can be remedied with an appropriate filter. 352.Pp 353The interface remains in promiscuous mode until all files listening 354promiscuously are closed. 355.It Dv BIOCFLUSH 356Flushes the buffer of incoming packets, 357and resets the statistics that are returned by BIOCGSTATS. 358.It Dv BIOCGETIF 359.Pq Li "struct ifreq" 360Returns the name of the hardware interface that the file is listening on. 361The name is returned in the ifr_name field of 362the 363.Li ifreq 364structure. 365All other fields are undefined. 366.It Dv BIOCSETIF 367.Pq Li "struct ifreq" 368Sets the hardware interface associated with the file. 369This 370command must be performed before any packets can be read. 371The device is indicated by name using the 372.Li ifr_name 373field of the 374.Li ifreq 375structure. 376Additionally, performs the actions of 377.Dv BIOCFLUSH . 378.It Dv BIOCSRTIMEOUT 379.It Dv BIOCGRTIMEOUT 380.Pq Li "struct timeval" 381Sets or gets the read timeout parameter. 382The argument 383specifies the length of time to wait before timing 384out on a read request. 385This parameter is initialized to zero by 386.Xr open 2 , 387indicating no timeout. 388.It Dv BIOCGSTATS 389.Pq Li "struct bpf_stat" 390Returns the following structure of packet statistics: 391.Bd -literal 392struct bpf_stat { 393 u_int bs_recv; /* number of packets received */ 394 u_int bs_drop; /* number of packets dropped */ 395}; 396.Ed 397.Pp 398The fields are: 399.Bl -hang -offset indent 400.It Li bs_recv 401the number of packets received by the descriptor since opened or reset 402(including any buffered since the last read call); 403and 404.It Li bs_drop 405the number of packets which were accepted by the filter but dropped by the 406kernel because of buffer overflows 407(i.e., the application's reads are not keeping up with the packet traffic). 408.El 409.It Dv BIOCIMMEDIATE 410.Pq Li u_int 411Enables or disables 412.Dq immediate mode , 413based on the truth value of the argument. 414When immediate mode is enabled, reads return immediately upon packet 415reception. 416Otherwise, a read will block until either the kernel buffer 417becomes full or a timeout occurs. 418This is useful for programs like 419.Xr rarpd 8 420which must respond to messages in real time. 421The default for a new file is off. 422.It Dv BIOCSETF 423.It Dv BIOCSETFNR 424.Pq Li "struct bpf_program" 425Sets the read filter program used by the kernel to discard uninteresting 426packets. 427An array of instructions and its length is passed in using 428the following structure: 429.Bd -literal 430struct bpf_program { 431 u_int bf_len; 432 struct bpf_insn *bf_insns; 433}; 434.Ed 435.Pp 436The filter program is pointed to by the 437.Li bf_insns 438field while its length in units of 439.Sq Li struct bpf_insn 440is given by the 441.Li bf_len 442field. 443See section 444.Sx "FILTER MACHINE" 445for an explanation of the filter language. 446The only difference between 447.Dv BIOCSETF 448and 449.Dv BIOCSETFNR 450is 451.Dv BIOCSETF 452performs the actions of 453.Dv BIOCFLUSH 454while 455.Dv BIOCSETFNR 456does not. 457.It Dv BIOCSETWF 458.Pq Li "struct bpf_program" 459Sets the write filter program used by the kernel to control what type of 460packets can be written to the interface. 461See the 462.Dv BIOCSETF 463command for more 464information on the 465.Nm 466filter program. 467.It Dv BIOCVERSION 468.Pq Li "struct bpf_version" 469Returns the major and minor version numbers of the filter language currently 470recognized by the kernel. 471Before installing a filter, applications must check 472that the current version is compatible with the running kernel. 473Version numbers are compatible if the major numbers match and the application minor 474is less than or equal to the kernel minor. 475The kernel version number is returned in the following structure: 476.Bd -literal 477struct bpf_version { 478 u_short bv_major; 479 u_short bv_minor; 480}; 481.Ed 482.Pp 483The current version numbers are given by 484.Dv BPF_MAJOR_VERSION 485and 486.Dv BPF_MINOR_VERSION 487from 488.In net/bpf.h . 489An incompatible filter 490may result in undefined behavior (most likely, an error returned by 491.Fn ioctl 492or haphazard packet matching). 493.It Dv BIOCGRSIG 494.It Dv BIOCSRSIG 495.Pq Li u_int 496Sets or gets the receive signal. 497This signal will be sent to the process or process group specified by 498.Dv FIOSETOWN . 499It defaults to 500.Dv SIGIO . 501.It Dv BIOCSHDRCMPLT 502.It Dv BIOCGHDRCMPLT 503.Pq Li u_int 504Sets or gets the status of the 505.Dq header complete 506flag. 507Set to zero if the link level source address should be filled in automatically 508by the interface output routine. 509Set to one if the link level source 510address will be written, as provided, to the wire. 511This flag is initialized to zero by default. 512.It Dv BIOCSSEESENT 513.It Dv BIOCGSEESENT 514.Pq Li u_int 515These commands are obsolete but left for compatibility. 516Use 517.Dv BIOCSDIRECTION 518and 519.Dv BIOCGDIRECTION 520instead. 521Sets or gets the flag determining whether locally generated packets on the 522interface should be returned by BPF. 523Set to zero to see only incoming packets on the interface. 524Set to one to see packets originating locally and remotely on the interface. 525This flag is initialized to one by default. 526.It Dv BIOCSDIRECTION 527.It Dv BIOCGDIRECTION 528.Pq Li u_int 529Sets or gets the setting determining whether incoming, outgoing, or all packets 530on the interface should be returned by BPF. 531Set to 532.Dv BPF_D_IN 533to see only incoming packets on the interface. 534Set to 535.Dv BPF_D_INOUT 536to see packets originating locally and remotely on the interface. 537Set to 538.Dv BPF_D_OUT 539to see only outgoing packets on the interface. 540This setting is initialized to 541.Dv BPF_D_INOUT 542by default. 543.It Dv BIOCSTSTAMP 544.It Dv BIOCGTSTAMP 545.Pq Li u_int 546Set or get format and resolution of the time stamps returned by BPF. 547Set to 548.Dv BPF_T_MICROTIME , 549.Dv BPF_T_MICROTIME_FAST , 550.Dv BPF_T_MICROTIME_MONOTONIC , 551or 552.Dv BPF_T_MICROTIME_MONOTONIC_FAST 553to get time stamps in 64-bit 554.Vt struct timeval 555format. 556Set to 557.Dv BPF_T_NANOTIME , 558.Dv BPF_T_NANOTIME_FAST , 559.Dv BPF_T_NANOTIME_MONOTONIC , 560or 561.Dv BPF_T_NANOTIME_MONOTONIC_FAST 562to get time stamps in 64-bit 563.Vt struct timespec 564format. 565Set to 566.Dv BPF_T_BINTIME , 567.Dv BPF_T_BINTIME_FAST , 568.Dv BPF_T_NANOTIME_MONOTONIC , 569or 570.Dv BPF_T_BINTIME_MONOTONIC_FAST 571to get time stamps in 64-bit 572.Vt struct bintime 573format. 574Set to 575.Dv BPF_T_NONE 576to ignore time stamp. 577All 64-bit time stamp formats are wrapped in 578.Vt struct bpf_ts . 579The 580.Dv BPF_T_MICROTIME_FAST , 581.Dv BPF_T_NANOTIME_FAST , 582.Dv BPF_T_BINTIME_FAST , 583.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 584.Dv BPF_T_NANOTIME_MONOTONIC_FAST , 585and 586.Dv BPF_T_BINTIME_MONOTONIC_FAST 587are analogs of corresponding formats without _FAST suffix but do not perform 588a full time counter query, so their accuracy is one timer tick. 589The 590.Dv BPF_T_MICROTIME_MONOTONIC , 591.Dv BPF_T_NANOTIME_MONOTONIC , 592.Dv BPF_T_BINTIME_MONOTONIC , 593.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 594.Dv BPF_T_NANOTIME_MONOTONIC_FAST , 595and 596.Dv BPF_T_BINTIME_MONOTONIC_FAST 597store the time elapsed since kernel boot. 598This setting is initialized to 599.Dv BPF_T_MICROTIME 600by default. 601.It Dv BIOCFEEDBACK 602.Pq Li u_int 603Set packet feedback mode. 604This allows injected packets to be fed back as input to the interface when 605output via the interface is successful. 606When 607.Dv BPF_D_INOUT 608direction is set, injected outgoing packet is not returned by BPF to avoid 609duplication. 610This flag is initialized to zero by default. 611.It Dv BIOCLOCK 612Set the locked flag on the 613.Nm 614descriptor. 615This prevents the execution of 616ioctl commands which could change the underlying operating parameters of 617the device. 618.It Dv BIOCGETBUFMODE 619.It Dv BIOCSETBUFMODE 620.Pq Li u_int 621Get or set the current 622.Nm 623buffering mode; possible values are 624.Dv BPF_BUFMODE_BUFFER , 625buffered read mode, and 626.Dv BPF_BUFMODE_ZBUF , 627zero-copy buffer mode. 628.It Dv BIOCSETZBUF 629.Pq Li struct bpf_zbuf 630Set the current zero-copy buffer locations; buffer locations may be 631set only once zero-copy buffer mode has been selected, and prior to attaching 632to an interface. 633Buffers must be of identical size, page-aligned, and an integer multiple of 634pages in size. 635The three fields 636.Vt bz_bufa , 637.Vt bz_bufb , 638and 639.Vt bz_buflen 640must be filled out. 641If buffers have already been set for this device, the ioctl will fail. 642.It Dv BIOCGETZMAX 643.Pq Li size_t 644Get the largest individual zero-copy buffer size allowed. 645As two buffers are used in zero-copy buffer mode, the limit (in practice) is 646twice the returned size. 647As zero-copy buffers consume kernel address space, conservative selection of 648buffer size is suggested, especially when there are multiple 649.Nm 650descriptors in use on 32-bit systems. 651.It Dv BIOCROTZBUF 652Force ownership of the next buffer to be assigned to userspace, if any data 653present in the buffer. 654If no data is present, the buffer will remain owned by the kernel. 655This allows consumers of zero-copy buffering to implement timeouts and 656retrieve partially filled buffers. 657In order to handle the case where no data is present in the buffer and 658therefore ownership is not assigned, the user process must check 659.Vt bzh_kernel_gen 660against 661.Vt bzh_user_gen . 662.El 663.Sh STANDARD IOCTLS 664.Nm 665now supports several standard 666.Xr ioctl 2 Ns 's 667which allow the user to do async and/or non-blocking I/O to an open 668.I bpf 669file descriptor. 670.Bl -tag -width SIOCGIFADDR 671.It Dv FIONREAD 672.Pq Li int 673Returns the number of bytes that are immediately available for reading. 674.It Dv SIOCGIFADDR 675.Pq Li "struct ifreq" 676Returns the address associated with the interface. 677.It Dv FIONBIO 678.Pq Li int 679Sets or clears non-blocking I/O. 680If arg is non-zero, then doing a 681.Xr read 2 682when no data is available will return -1 and 683.Va errno 684will be set to 685.Er EAGAIN . 686If arg is zero, non-blocking I/O is disabled. 687Note: setting this overrides the timeout set by 688.Dv BIOCSRTIMEOUT . 689.It Dv FIOASYNC 690.Pq Li int 691Enables or disables async I/O. 692When enabled (arg is non-zero), the process or process group specified by 693.Dv FIOSETOWN 694will start receiving 695.Dv SIGIO 's 696when packets arrive. 697Note that you must do an 698.Dv FIOSETOWN 699in order for this to take affect, 700as the system will not default this for you. 701The signal may be changed via 702.Dv BIOCSRSIG . 703.It Dv FIOSETOWN 704.It Dv FIOGETOWN 705.Pq Li int 706Sets or gets the process or process group (if negative) that should 707receive 708.Dv SIGIO 709when packets are available. 710The signal may be changed using 711.Dv BIOCSRSIG 712(see above). 713.El 714.Sh BPF HEADER 715One of the following structures is prepended to each packet returned by 716.Xr read 2 717or via a zero-copy buffer: 718.Bd -literal 719struct bpf_xhdr { 720 struct bpf_ts bh_tstamp; /* time stamp */ 721 uint32_t bh_caplen; /* length of captured portion */ 722 uint32_t bh_datalen; /* original length of packet */ 723 u_short bh_hdrlen; /* length of bpf header (this struct 724 plus alignment padding) */ 725}; 726 727struct bpf_hdr { 728 struct timeval bh_tstamp; /* time stamp */ 729 uint32_t bh_caplen; /* length of captured portion */ 730 uint32_t bh_datalen; /* original length of packet */ 731 u_short bh_hdrlen; /* length of bpf header (this struct 732 plus alignment padding) */ 733}; 734.Ed 735.Pp 736The fields, whose values are stored in host order, and are: 737.Pp 738.Bl -tag -compact -width bh_datalen 739.It Li bh_tstamp 740The time at which the packet was processed by the packet filter. 741.It Li bh_caplen 742The length of the captured portion of the packet. 743This is the minimum of 744the truncation amount specified by the filter and the length of the packet. 745.It Li bh_datalen 746The length of the packet off the wire. 747This value is independent of the truncation amount specified by the filter. 748.It Li bh_hdrlen 749The length of the 750.Nm 751header, which may not be equal to 752.\" XXX - not really a function call 753.Fn sizeof "struct bpf_xhdr" 754or 755.Fn sizeof "struct bpf_hdr" . 756.El 757.Pp 758The 759.Li bh_hdrlen 760field exists to account for 761padding between the header and the link level protocol. 762The purpose here is to guarantee proper alignment of the packet 763data structures, which is required on alignment sensitive 764architectures and improves performance on many other architectures. 765The packet filter ensures that the 766.Vt bpf_xhdr , 767.Vt bpf_hdr 768and the network layer 769header will be word aligned. 770Currently, 771.Vt bpf_hdr 772is used when the time stamp is set to 773.Dv BPF_T_MICROTIME , 774.Dv BPF_T_MICROTIME_FAST , 775.Dv BPF_T_MICROTIME_MONOTONIC , 776.Dv BPF_T_MICROTIME_MONOTONIC_FAST , 777or 778.Dv BPF_T_NONE 779for backward compatibility reasons. 780Otherwise, 781.Vt bpf_xhdr 782is used. 783However, 784.Vt bpf_hdr 785may be deprecated in the near future. 786Suitable precautions 787must be taken when accessing the link layer protocol fields on alignment 788restricted machines. 789(This is not a problem on an Ethernet, since 790the type field is a short falling on an even offset, 791and the addresses are probably accessed in a bytewise fashion). 792.Pp 793Additionally, individual packets are padded so that each starts 794on a word boundary. 795This requires that an application 796has some knowledge of how to get from packet to packet. 797The macro 798.Dv BPF_WORDALIGN 799is defined in 800.In net/bpf.h 801to facilitate 802this process. 803It rounds up its argument to the nearest word aligned value (where a word is 804.Dv BPF_ALIGNMENT 805bytes wide). 806.Pp 807For example, if 808.Sq Li p 809points to the start of a packet, this expression 810will advance it to the next packet: 811.Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen) 812.Pp 813For the alignment mechanisms to work properly, the 814buffer passed to 815.Xr read 2 816must itself be word aligned. 817The 818.Xr malloc 3 819function 820will always return an aligned buffer. 821.Sh FILTER MACHINE 822A filter program is an array of instructions, with all branches forwardly 823directed, terminated by a 824.Em return 825instruction. 826Each instruction performs some action on the pseudo-machine state, 827which consists of an accumulator, index register, scratch memory store, 828and implicit program counter. 829.Pp 830The following structure defines the instruction format: 831.Bd -literal 832struct bpf_insn { 833 u_short code; 834 u_char jt; 835 u_char jf; 836 bpf_u_int32 k; 837}; 838.Ed 839.Pp 840The 841.Li k 842field is used in different ways by different instructions, 843and the 844.Li jt 845and 846.Li jf 847fields are used as offsets 848by the branch instructions. 849The opcodes are encoded in a semi-hierarchical fashion. 850There are eight classes of instructions: 851.Dv BPF_LD , 852.Dv BPF_LDX , 853.Dv BPF_ST , 854.Dv BPF_STX , 855.Dv BPF_ALU , 856.Dv BPF_JMP , 857.Dv BPF_RET , 858and 859.Dv BPF_MISC . 860Various other mode and 861operator bits are or'd into the class to give the actual instructions. 862The classes and modes are defined in 863.In net/bpf.h . 864.Pp 865Below are the semantics for each defined 866.Nm 867instruction. 868We use the convention that A is the accumulator, X is the index register, 869P[] packet data, and M[] scratch memory store. 870P[i:n] gives the data at byte offset 871.Dq i 872in the packet, 873interpreted as a word (n=4), 874unsigned halfword (n=2), or unsigned byte (n=1). 875M[i] gives the i'th word in the scratch memory store, which is only 876addressed in word units. 877The memory store is indexed from 0 to 878.Dv BPF_MEMWORDS 879- 1. 880.Li k , 881.Li jt , 882and 883.Li jf 884are the corresponding fields in the 885instruction definition. 886.Dq len 887refers to the length of the packet. 888.Bl -tag -width BPF_STXx 889.It Dv BPF_LD 890These instructions copy a value into the accumulator. 891The type of the source operand is specified by an 892.Dq addressing mode 893and can be a constant 894.Pq Dv BPF_IMM , 895packet data at a fixed offset 896.Pq Dv BPF_ABS , 897packet data at a variable offset 898.Pq Dv BPF_IND , 899the packet length 900.Pq Dv BPF_LEN , 901or a word in the scratch memory store 902.Pq Dv BPF_MEM . 903For 904.Dv BPF_IND 905and 906.Dv BPF_ABS , 907the data size must be specified as a word 908.Pq Dv BPF_W , 909halfword 910.Pq Dv BPF_H , 911or byte 912.Pq Dv BPF_B . 913The semantics of all the recognized 914.Dv BPF_LD 915instructions follow. 916.Bd -literal 917BPF_LD+BPF_W+BPF_ABS A <- P[k:4] 918BPF_LD+BPF_H+BPF_ABS A <- P[k:2] 919BPF_LD+BPF_B+BPF_ABS A <- P[k:1] 920BPF_LD+BPF_W+BPF_IND A <- P[X+k:4] 921BPF_LD+BPF_H+BPF_IND A <- P[X+k:2] 922BPF_LD+BPF_B+BPF_IND A <- P[X+k:1] 923BPF_LD+BPF_W+BPF_LEN A <- len 924BPF_LD+BPF_IMM A <- k 925BPF_LD+BPF_MEM A <- M[k] 926.Ed 927.It Dv BPF_LDX 928These instructions load a value into the index register. 929Note that 930the addressing modes are more restrictive than those of the accumulator loads, 931but they include 932.Dv BPF_MSH , 933a hack for efficiently loading the IP header length. 934.Bd -literal 935BPF_LDX+BPF_W+BPF_IMM X <- k 936BPF_LDX+BPF_W+BPF_MEM X <- M[k] 937BPF_LDX+BPF_W+BPF_LEN X <- len 938BPF_LDX+BPF_B+BPF_MSH X <- 4*(P[k:1]&0xf) 939.Ed 940.It Dv BPF_ST 941This instruction stores the accumulator into the scratch memory. 942We do not need an addressing mode since there is only one possibility 943for the destination. 944.Bd -literal 945BPF_ST M[k] <- A 946.Ed 947.It Dv BPF_STX 948This instruction stores the index register in the scratch memory store. 949.Bd -literal 950BPF_STX M[k] <- X 951.Ed 952.It Dv BPF_ALU 953The alu instructions perform operations between the accumulator and 954index register or constant, and store the result back in the accumulator. 955For binary operations, a source mode is required 956.Dv ( BPF_K 957or 958.Dv BPF_X ) . 959.Bd -literal 960BPF_ALU+BPF_ADD+BPF_K A <- A + k 961BPF_ALU+BPF_SUB+BPF_K A <- A - k 962BPF_ALU+BPF_MUL+BPF_K A <- A * k 963BPF_ALU+BPF_DIV+BPF_K A <- A / k 964BPF_ALU+BPF_MOD+BPF_K A <- A % k 965BPF_ALU+BPF_AND+BPF_K A <- A & k 966BPF_ALU+BPF_OR+BPF_K A <- A | k 967BPF_ALU+BPF_XOR+BPF_K A <- A ^ k 968BPF_ALU+BPF_LSH+BPF_K A <- A << k 969BPF_ALU+BPF_RSH+BPF_K A <- A >> k 970BPF_ALU+BPF_ADD+BPF_X A <- A + X 971BPF_ALU+BPF_SUB+BPF_X A <- A - X 972BPF_ALU+BPF_MUL+BPF_X A <- A * X 973BPF_ALU+BPF_DIV+BPF_X A <- A / X 974BPF_ALU+BPF_MOD+BPF_X A <- A % X 975BPF_ALU+BPF_AND+BPF_X A <- A & X 976BPF_ALU+BPF_OR+BPF_X A <- A | X 977BPF_ALU+BPF_XOR+BPF_X A <- A ^ X 978BPF_ALU+BPF_LSH+BPF_X A <- A << X 979BPF_ALU+BPF_RSH+BPF_X A <- A >> X 980BPF_ALU+BPF_NEG A <- -A 981.Ed 982.It Dv BPF_JMP 983The jump instructions alter flow of control. 984Conditional jumps 985compare the accumulator against a constant 986.Pq Dv BPF_K 987or the index register 988.Pq Dv BPF_X . 989If the result is true (or non-zero), 990the true branch is taken, otherwise the false branch is taken. 991Jump offsets are encoded in 8 bits so the longest jump is 256 instructions. 992However, the jump always 993.Pq Dv BPF_JA 994opcode uses the 32 bit 995.Li k 996field as the offset, allowing arbitrarily distant destinations. 997All conditionals use unsigned comparison conventions. 998.Bd -literal 999BPF_JMP+BPF_JA pc += k 1000BPF_JMP+BPF_JGT+BPF_K pc += (A > k) ? jt : jf 1001BPF_JMP+BPF_JGE+BPF_K pc += (A >= k) ? jt : jf 1002BPF_JMP+BPF_JEQ+BPF_K pc += (A == k) ? jt : jf 1003BPF_JMP+BPF_JSET+BPF_K pc += (A & k) ? jt : jf 1004BPF_JMP+BPF_JGT+BPF_X pc += (A > X) ? jt : jf 1005BPF_JMP+BPF_JGE+BPF_X pc += (A >= X) ? jt : jf 1006BPF_JMP+BPF_JEQ+BPF_X pc += (A == X) ? jt : jf 1007BPF_JMP+BPF_JSET+BPF_X pc += (A & X) ? jt : jf 1008.Ed 1009.It Dv BPF_RET 1010The return instructions terminate the filter program and specify the amount 1011of packet to accept (i.e., they return the truncation amount). 1012A return value of zero indicates that the packet should be ignored. 1013The return value is either a constant 1014.Pq Dv BPF_K 1015or the accumulator 1016.Pq Dv BPF_A . 1017.Bd -literal 1018BPF_RET+BPF_A accept A bytes 1019BPF_RET+BPF_K accept k bytes 1020.Ed 1021.It Dv BPF_MISC 1022The miscellaneous category was created for anything that does not 1023fit into the above classes, and for any new instructions that might need to 1024be added. 1025Currently, these are the register transfer instructions 1026that copy the index register to the accumulator or vice versa. 1027.Bd -literal 1028BPF_MISC+BPF_TAX X <- A 1029BPF_MISC+BPF_TXA A <- X 1030.Ed 1031.El 1032.Pp 1033The 1034.Nm 1035interface provides the following macros to facilitate 1036array initializers: 1037.Fn BPF_STMT opcode operand 1038and 1039.Fn BPF_JUMP opcode operand true_offset false_offset . 1040.Sh SYSCTL VARIABLES 1041A set of 1042.Xr sysctl 8 1043variables controls the behaviour of the 1044.Nm 1045subsystem 1046.Bl -tag -width indent 1047.It Va net.bpf.optimize_writers : No 0 1048Various programs use BPF to send (but not receive) raw packets 1049(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs). 1050They do not need incoming packets to be send to them. 1051Turning this option on 1052makes new BPF users to be attached to write-only interface list until program 1053explicitly specifies read filter via 1054.Fn pcap_set_filter . 1055This removes any performance degradation for high-speed interfaces. 1056.It Va net.bpf.stats : 1057Binary interface for retrieving general statistics. 1058.It Va net.bpf.zerocopy_enable : No 0 1059Permits zero-copy to be used with net BPF readers. 1060Use with caution. 1061.It Va net.bpf.maxinsns : No 512 1062Maximum number of instructions that BPF program can contain. 1063Use 1064.Xr tcpdump 1 1065.Fl d 1066option to determine approximate number of instruction for any filter. 1067.It Va net.bpf.maxbufsize : No 524288 1068Maximum buffer size to allocate for packets buffer. 1069.It Va net.bpf.bufsize : No 4096 1070Default buffer size to allocate for packets buffer. 1071.El 1072.Sh EXAMPLES 1073The following filter is taken from the Reverse ARP Daemon. 1074It accepts only Reverse ARP requests. 1075.Bd -literal 1076struct bpf_insn insns[] = { 1077 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1078 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3), 1079 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20), 1080 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1), 1081 BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) + 1082 sizeof(struct ether_header)), 1083 BPF_STMT(BPF_RET+BPF_K, 0), 1084}; 1085.Ed 1086.Pp 1087This filter accepts only IP packets between host 128.3.112.15 and 1088128.3.112.35. 1089.Bd -literal 1090struct bpf_insn insns[] = { 1091 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1092 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8), 1093 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26), 1094 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2), 1095 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30), 1096 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4), 1097 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3), 1098 BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30), 1099 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1), 1100 BPF_STMT(BPF_RET+BPF_K, (u_int)-1), 1101 BPF_STMT(BPF_RET+BPF_K, 0), 1102}; 1103.Ed 1104.Pp 1105Finally, this filter returns only TCP finger packets. 1106We must parse the IP header to reach the TCP header. 1107The 1108.Dv BPF_JSET 1109instruction 1110checks that the IP fragment offset is 0 so we are sure 1111that we have a TCP header. 1112.Bd -literal 1113struct bpf_insn insns[] = { 1114 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), 1115 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10), 1116 BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23), 1117 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8), 1118 BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20), 1119 BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0), 1120 BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14), 1121 BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14), 1122 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0), 1123 BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16), 1124 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1), 1125 BPF_STMT(BPF_RET+BPF_K, (u_int)-1), 1126 BPF_STMT(BPF_RET+BPF_K, 0), 1127}; 1128.Ed 1129.Sh SEE ALSO 1130.Xr tcpdump 1 , 1131.Xr ioctl 2 , 1132.Xr kqueue 2 , 1133.Xr poll 2 , 1134.Xr select 2 , 1135.Xr ng_bpf 4 , 1136.Xr bpf 9 1137.Rs 1138.%A McCanne, S. 1139.%A Jacobson V. 1140.%T "An efficient, extensible, and portable network monitor" 1141.Re 1142.Sh HISTORY 1143The Enet packet filter was created in 1980 by Mike Accetta and 1144Rick Rashid at Carnegie-Mellon University. 1145Jeffrey Mogul, at 1146Stanford, ported the code to 1147.Bx 1148and continued its development from 11491983 on. 1150Since then, it has evolved into the Ultrix Packet Filter at 1151.Tn DEC , 1152a 1153.Tn STREAMS 1154.Tn NIT 1155module under 1156.Tn SunOS 4.1 , 1157and 1158.Tn BPF . 1159.Sh AUTHORS 1160.An -nosplit 1161.An Steven McCanne , 1162of Lawrence Berkeley Laboratory, implemented BPF in 1163Summer 1990. 1164Much of the design is due to 1165.An Van Jacobson . 1166.Pp 1167Support for zero-copy buffers was added by 1168.An Robert N. M. Watson 1169under contract to Seccuris Inc. 1170.Sh BUGS 1171The read buffer must be of a fixed size (returned by the 1172.Dv BIOCGBLEN 1173ioctl). 1174.Pp 1175A file that does not request promiscuous mode may receive promiscuously 1176received packets as a side effect of another file requesting this 1177mode on the same hardware interface. 1178This could be fixed in the kernel with additional processing overhead. 1179However, we favor the model where 1180all files must assume that the interface is promiscuous, and if 1181so desired, must utilize a filter to reject foreign packets. 1182.Pp 1183The 1184.Dv SEESENT , 1185.Dv DIRECTION , 1186and 1187.Dv FEEDBACK 1188settings have been observed to work incorrectly on some interface 1189types, including those with hardware loopback rather than software loopback, 1190and point-to-point interfaces. 1191They appear to function correctly on a 1192broad range of Ethernet-style interfaces. 1193