1.\" Copyright (c) 2001-2003 International Computer Science Institute 2.\" 3.\" Permission is hereby granted, free of charge, to any person obtaining a 4.\" copy of this software and associated documentation files (the "Software"), 5.\" to deal in the Software without restriction, including without limitation 6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, 7.\" and/or sell copies of the Software, and to permit persons to whom the 8.\" Software is furnished to do so, subject to the following conditions: 9.\" 10.\" The above copyright notice and this permission notice shall be included in 11.\" all copies or substantial portions of the Software. 12.\" 13.\" The names and trademarks of copyright holders may not be used in 14.\" advertising or publicity pertaining to the software without specific 15.\" prior permission. Title to copyright in this software and any associated 16.\" documentation will at all times remain with the copyright holders. 17.\" 18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24.\" DEALINGS IN THE SOFTWARE. 25.\" 26.Dd February 13, 2026 27.Dt MULTICAST 4 28.Os 29.\" 30.Sh NAME 31.Nm multicast 32.Nd Multicast Routing 33.\" 34.Sh SYNOPSIS 35.Cd "options MROUTING" 36.Pp 37.In sys/types.h 38.In sys/socket.h 39.In netinet/in.h 40.In netinet/ip_mroute.h 41.In netinet6/ip6_mroute.h 42.Ft int 43.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen" 44.Ft int 45.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen" 46.Ft int 47.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen" 48.Ft int 49.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen" 50.Sh DESCRIPTION 51.Tn "Multicast routing" 52is used to efficiently propagate data 53packets to a set of multicast listeners in multipoint networks. 54If unicast is used to replicate the data to all listeners, 55then some of the network links may carry multiple copies of the same 56data packets. 57With multicast routing, the overhead is reduced to one copy 58(at most) per network link. 59.Pp 60All multicast-capable routers must run a common multicast routing 61protocol. 62It is recommended that either 63Protocol Independent Multicast - Sparse Mode (PIM-SM), 64or Protocol Independent Multicast - Dense Mode (PIM-DM) 65are used, as these are now the generally accepted protocols 66in the Internet community. 67The 68.Sx HISTORY 69section discusses previous multicast routing protocols. 70.Pp 71To start multicast routing, 72the user must enable multicast forwarding in the kernel 73(see 74.Sx SYNOPSIS 75about the kernel configuration options), 76and must run a multicast routing capable user-level process. 77From developer's point of view, 78the programming guide described in the 79.Sx "Programming Guide" 80section should be used to control the multicast forwarding in the kernel. 81.\" 82.Ss Programming Guide 83This section provides information about the basic multicast routing API. 84The so-called 85.Dq advanced multicast API 86is described in the 87.Sx "Advanced Multicast API Programming Guide" 88section. 89.Pp 90First, a multicast routing socket must be open. 91That socket would be used 92to control the multicast forwarding in the kernel. 93Note that most operations below require certain privilege 94(i.e., root privilege): 95.Bd -literal 96/* IPv4 */ 97int mrouter_s4; 98mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP); 99.Ed 100.Bd -literal 101int mrouter_s6; 102mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); 103.Ed 104.Pp 105Note that if the router needs to open an IGMP or ICMPv6 socket 106(in case of IPv4 and IPv6 respectively) 107for sending or receiving of IGMP or MLD multicast group membership messages, 108then the same 109.Va mrouter_s4 110or 111.Va mrouter_s6 112sockets should be used 113for sending and receiving respectively IGMP or MLD messages. 114In case of 115.Bx Ns 116-derived kernel, it may be possible to open separate sockets 117for IGMP or MLD messages only. 118However, some other kernels (e.g., 119.Tn Linux ) 120require that the multicast 121routing socket must be used for sending and receiving of IGMP or MLD 122messages. 123Therefore, for portability reason the multicast 124routing socket should be reused for IGMP and MLD messages as well. 125.Pp 126After the multicast routing socket is open, it can be used to enable 127multicast forwarding in the kernel: 128.Bd -literal 129/* IPv4 */ 130int v = 1; 131setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v)); 132.Ed 133.Bd -literal 134/* IPv6 */ 135int v = 1; 136setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v)); 137\&... 138/* If necessary, filter all ICMPv6 messages */ 139struct icmp6_filter filter; 140ICMP6_FILTER_SETBLOCKALL(&filter); 141setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter, 142 sizeof(filter)); 143.Ed 144.Pp 145When applied to the multicast routing socket, the 146.Dv MRT_DONE 147and 148.Dv MRT6_DONE 149socket options disable multicast forwarding in the kernel: 150.Bd -literal 151/* IPv4 */ 152int v = 1; 153setsockopt(mrouter_s4, IPPROTO_IP, MRT_DONE, (void *)&v, sizeof(v)); 154.Ed 155.Bd -literal 156/* IPv6 */ 157int v = 1; 158setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DONE, (void *)&v, sizeof(v)); 159.Ed 160.Pp 161Closing the socket has the same effect. 162.Pp 163After multicast forwarding is enabled, the multicast routing socket 164can be used to enable PIM processing in the kernel if we are running PIM-SM or 165PIM-DM 166(see 167.Xr pim 4 ) . 168.Pp 169For each network interface (e.g., physical or a virtual tunnel) 170that would be used for multicast forwarding, a corresponding 171multicast interface must be added to the kernel: 172.Bd -literal 173/* IPv4 */ 174struct vifctl vc; 175memset(&vc, 0, sizeof(vc)); 176/* Assign all vifctl fields as appropriate */ 177vc.vifc_vifi = vif_index; 178vc.vifc_flags = vif_flags; 179vc.vifc_threshold = min_ttl_threshold; 180vc.vifc_rate_limit = 0; 181memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr)); 182setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc, 183 sizeof(vc)); 184.Ed 185.Pp 186The 187.Va vif_index 188must be unique per vif. 189The 190.Va vif_flags 191contains the 192.Dv VIFF_* 193flags as defined in 194.In netinet/ip_mroute.h . 195The 196.Dv VIFF_TUNNEL 197flag is no longer supported by 198.Fx . 199Users who wish to forward multicast datagrams over a tunnel should consider 200configuring a 201.Xr gif 4 202or 203.Xr gre 4 204tunnel and using it as a physical interface. 205.Pp 206The 207.Va min_ttl_threshold 208contains the minimum TTL a multicast data packet must have to be 209forwarded on that vif. 210Typically, it would have value of 1. 211.Pp 212The 213.Va max_rate_limit 214argument is no longer supported in 215.Fx 216and should be set to 0. 217Users who wish to rate-limit multicast datagrams should consider the use of 218.Xr dummynet 4 219or 220.Xr altq 4 . 221.Pp 222The 223.Va vif_local_address 224contains the local IP address of the corresponding local interface. 225The 226.Va vif_remote_address 227contains the remote IP address in case of DVMRP multicast tunnels. 228.Bd -literal 229/* IPv6 */ 230struct mif6ctl mc; 231memset(&mc, 0, sizeof(mc)); 232/* Assign all mif6ctl fields as appropriate */ 233mc.mif6c_mifi = mif_index; 234mc.mif6c_flags = mif_flags; 235mc.mif6c_pifi = pif_index; 236setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc, 237 sizeof(mc)); 238.Ed 239.Pp 240The 241.Va mif_index 242must be unique per vif. 243The 244.Va mif_flags 245contains the 246.Dv MIFF_* 247flags as defined in 248.In netinet6/ip6_mroute.h . 249The 250.Va pif_index 251is the physical interface index of the corresponding local interface. 252.Pp 253A multicast interface is deleted by: 254.Bd -literal 255/* IPv4 */ 256vifi_t vifi = vif_index; 257setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi, 258 sizeof(vifi)); 259.Ed 260.Bd -literal 261/* IPv6 */ 262mifi_t mifi = mif_index; 263setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi, 264 sizeof(mifi)); 265.Ed 266.Pp 267After the multicast forwarding is enabled, and the multicast virtual 268interfaces are 269added, the kernel may deliver upcall messages (also called signals 270later in this text) on the multicast routing socket that was open 271earlier with 272.Dv MRT_INIT 273or 274.Dv MRT6_INIT . 275The IPv4 upcalls have 276.Vt "struct igmpmsg" 277header (see 278.In netinet/ip_mroute.h ) 279with field 280.Va im_mbz 281set to zero. 282Note that this header follows the structure of 283.Vt "struct ip" 284with the protocol field 285.Va ip_p 286set to zero. 287The IPv6 upcalls have 288.Vt "struct mrt6msg" 289header (see 290.In netinet6/ip6_mroute.h ) 291with field 292.Va im6_mbz 293set to zero. 294Note that this header follows the structure of 295.Vt "struct ip6_hdr" 296with the next header field 297.Va ip6_nxt 298set to zero. 299.Pp 300The upcall header contains field 301.Va im_msgtype 302and 303.Va im6_msgtype 304with the type of the upcall 305.Dv IGMPMSG_* 306and 307.Dv MRT6MSG_* 308for IPv4 and IPv6 respectively. 309The values of the rest of the upcall header fields 310and the body of the upcall message depend on the particular upcall type. 311.Pp 312If the upcall message type is 313.Dv IGMPMSG_NOCACHE 314or 315.Dv MRT6MSG_NOCACHE , 316this is an indication that a multicast packet has reached the multicast 317router, but the router has no forwarding state for that packet. 318Typically, the upcall would be a signal for the multicast routing 319user-level process to install the appropriate Multicast Forwarding 320Cache (MFC) entry in the kernel. 321.Pp 322An MFC entry is added by: 323.Bd -literal 324/* IPv4 */ 325struct mfcctl mc; 326memset(&mc, 0, sizeof(mc)); 327memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 328memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 329mc.mfcc_parent = iif_index; 330for (i = 0; i < maxvifs; i++) 331 mc.mfcc_ttls[i] = oifs_ttl[i]; 332setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, 333 (void *)&mc, sizeof(mc)); 334.Ed 335.Bd -literal 336/* IPv6 */ 337struct mf6cctl mc; 338memset(&mc, 0, sizeof(mc)); 339memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 340memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 341mc.mf6cc_parent = iif_index; 342for (i = 0; i < maxvifs; i++) 343 if (oifs_ttl[i] > 0) 344 IF_SET(i, &mc.mf6cc_ifset); 345setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MFC, 346 (void *)&mc, sizeof(mc)); 347.Ed 348.Pp 349The 350.Va source_addr 351and 352.Va group_addr 353are the source and group address of the multicast packet (as set 354in the upcall message). 355The 356.Va iif_index 357is the virtual interface index of the multicast interface the multicast 358packets for this specific source and group address should be received on. 359The 360.Va oifs_ttl[] 361array contains the minimum TTL (per interface) a multicast packet 362should have to be forwarded on an outgoing interface. 363If the TTL value is zero, the corresponding interface is not included 364in the set of outgoing interfaces. 365Note that in case of IPv6 only the set of outgoing interfaces can 366be specified. 367.Pp 368An MFC entry is deleted by: 369.Bd -literal 370/* IPv4 */ 371struct mfcctl mc; 372memset(&mc, 0, sizeof(mc)); 373memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 374memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 375setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, 376 (void *)&mc, sizeof(mc)); 377.Ed 378.Bd -literal 379/* IPv6 */ 380struct mf6cctl mc; 381memset(&mc, 0, sizeof(mc)); 382memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 383memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 384setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MFC, 385 (void *)&mc, sizeof(mc)); 386.Ed 387.Pp 388The following method can be used to get various statistics per 389installed MFC entry in the kernel (e.g., the number of forwarded 390packets per source and group address): 391.Bd -literal 392/* IPv4 */ 393struct sioc_sg_req sgreq; 394memset(&sgreq, 0, sizeof(sgreq)); 395memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 396memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 397ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq); 398.Ed 399.Bd -literal 400/* IPv6 */ 401struct sioc_sg_req6 sgreq; 402memset(&sgreq, 0, sizeof(sgreq)); 403memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 404memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 405ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq); 406.Ed 407.Pp 408The following method can be used to get various statistics per 409multicast virtual interface in the kernel (e.g., the number of forwarded 410packets per interface): 411.Bd -literal 412/* IPv4 */ 413struct sioc_vif_req vreq; 414memset(&vreq, 0, sizeof(vreq)); 415vreq.vifi = vif_index; 416ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq); 417.Ed 418.Bd -literal 419/* IPv6 */ 420struct sioc_mif_req6 mreq; 421memset(&mreq, 0, sizeof(mreq)); 422mreq.mifi = vif_index; 423ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq); 424.Ed 425.Ss Advanced Multicast API Programming Guide 426If we want to add new features in the kernel, it becomes difficult 427to preserve backward compatibility (binary and API), 428and at the same time to allow user-level processes to take advantage of 429the new features (if the kernel supports them). 430.Pp 431One of the mechanisms that allows us to preserve the backward 432compatibility is a sort of negotiation 433between the user-level process and the kernel: 434.Bl -enum 435.It 436The user-level process tries to enable in the kernel the set of new 437features (and the corresponding API) it would like to use. 438.It 439The kernel returns the (sub)set of features it knows about 440and is willing to be enabled. 441.It 442The user-level process uses only that set of features 443the kernel has agreed on. 444.El 445.\" 446.Pp 447To support backward compatibility, if the user-level process does not 448ask for any new features, the kernel defaults to the basic 449multicast API (see the 450.Sx "Programming Guide" 451section). 452.\" XXX: edit as appropriate after the advanced multicast API is 453.\" supported under IPv6 454Currently, the advanced multicast API exists only for IPv4; 455in the future there will be IPv6 support as well. 456.Pp 457Below is a summary of the expandable API solution. 458Note that all new options and structures are defined 459in 460.In netinet/ip_mroute.h 461and 462.In netinet6/ip6_mroute.h , 463unless stated otherwise. 464.Pp 465The user-level process uses new 466.Fn getsockopt Ns / Ns Fn setsockopt 467options to 468perform the API features negotiation with the kernel. 469This negotiation must be performed right after the multicast routing 470socket is open. 471The set of desired/allowed features is stored in a bitset 472(currently, in 473.Vt uint32_t ; 474i.e., maximum of 32 new features). 475The new 476.Fn getsockopt Ns / Ns Fn setsockopt 477options are 478.Dv MRT_API_SUPPORT 479and 480.Dv MRT_API_CONFIG . 481Example: 482.Bd -literal 483uint32_t v; 484getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v)); 485.Ed 486.Pp 487would set in 488.Va v 489the pre-defined bits that the kernel API supports. 490The eight least significant bits in 491.Vt uint32_t 492are same as the 493eight possible flags 494.Dv MRT_MFC_FLAGS_* 495that can be used in 496.Va mfcc_flags 497as part of the new definition of 498.Vt "struct mfcctl" 499(see below about those flags), which leaves 24 flags for other new features. 500The value returned by 501.Fn getsockopt MRT_API_SUPPORT 502is read-only; in other words, 503.Fn setsockopt MRT_API_SUPPORT 504would fail. 505.Pp 506To modify the API, and to set some specific feature in the kernel, then: 507.Bd -literal 508uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF; 509if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)) 510 != 0) { 511 return (ERROR); 512} 513if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF) 514 return (OK); /* Success */ 515else 516 return (ERROR); 517.Ed 518.Pp 519In other words, when 520.Fn setsockopt MRT_API_CONFIG 521is called, the 522argument to it specifies the desired set of features to 523be enabled in the API and the kernel. 524The return value in 525.Va v 526is the actual (sub)set of features that were enabled in the kernel. 527To obtain later the same set of features that were enabled, then: 528.Bd -literal 529getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)); 530.Ed 531.Pp 532The set of enabled features is global. 533In other words, 534.Fn setsockopt MRT_API_CONFIG 535should be called right after 536.Fn setsockopt MRT_INIT . 537.Pp 538Currently, the following set of new features is defined: 539.Bd -literal 540#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 541#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 542#define MRT_MFC_RP (1 << 8) /* enable RP address */ 543#define MRT_MFC_BW_UPCALL (1 << 9) /* enable bw upcalls */ 544.Ed 545.\" .Pp 546.\" In the future there might be: 547.\" .Bd -literal 548.\" #define MRT_MFC_GROUP_SPECIFIC (1 << 10) /* allow (*,G) MFC entries */ 549.\" .Ed 550.\" .Pp 551.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel. 552.\" For now this is left-out until it is clear whether 553.\" (*,G) MFC support is the preferred solution instead of something more generic 554.\" solution for example. 555.\" 556.\" 2. The newly defined struct mfcctl2. 557.\" 558.Pp 559The advanced multicast API uses a newly defined 560.Vt "struct mfcctl2" 561instead of the traditional 562.Vt "struct mfcctl" . 563The original 564.Vt "struct mfcctl" 565is kept as is. 566The new 567.Vt "struct mfcctl2" 568is: 569.Bd -literal 570/* 571 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays 572 * and extends the old struct mfcctl. 573 */ 574struct mfcctl2 { 575 /* the mfcctl fields */ 576 struct in_addr mfcc_origin; /* ip origin of mcasts */ 577 struct in_addr mfcc_mcastgrp; /* multicast group associated*/ 578 vifi_t mfcc_parent; /* incoming vif */ 579 u_char mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs */ 580 581 /* extension fields */ 582 uint8_t mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/ 583 struct in_addr mfcc_rp; /* the RP address */ 584}; 585.Ed 586.Pp 587The new fields are 588.Va mfcc_flags[MAXVIFS] 589and 590.Va mfcc_rp . 591Note that for compatibility reasons they are added at the end. 592.Pp 593The 594.Va mfcc_flags[MAXVIFS] 595field is used to set various flags per 596interface per (S,G) entry. 597Currently, the defined flags are: 598.Bd -literal 599#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 600#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 601.Ed 602.Pp 603The 604.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 605flag is used to explicitly disable the 606.Dv IGMPMSG_WRONGVIF 607kernel signal at the (S,G) granularity if a multicast data packet 608arrives on the wrong interface. 609Usually, this signal is used to 610complete the shortest-path switch in case of PIM-SM multicast routing, 611or to trigger a PIM assert message. 612However, it should not be delivered for interfaces that are not in 613the outgoing interface set, and that are not expecting to 614become an incoming interface. 615Hence, if the 616.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 617flag is set for some of the 618interfaces, then a data packet that arrives on that interface for 619that MFC entry will NOT trigger a WRONGVIF signal. 620If that flag is not set, then a signal is triggered (the default action). 621.Pp 622The 623.Dv MRT_MFC_FLAGS_BORDER_VIF 624flag is used to specify whether the Border-bit in PIM 625Register messages should be set (in case when the Register encapsulation 626is performed inside the kernel). 627If it is set for the special PIM Register kernel virtual interface 628(see 629.Xr pim 4 ) , 630the Border-bit in the Register messages sent to the RP will be set. 631.Pp 632The remaining six bits are reserved for future usage. 633.Pp 634The 635.Va mfcc_rp 636field is used to specify the RP address (in case of PIM-SM multicast routing) 637for a multicast 638group G if we want to perform kernel-level PIM Register encapsulation. 639The 640.Va mfcc_rp 641field is used only if the 642.Dv MRT_MFC_RP 643advanced API flag/capability has been successfully set by 644.Fn setsockopt MRT_API_CONFIG . 645.Pp 646.\" 647.\" 3. Kernel-level PIM Register encapsulation 648.\" 649If the 650.Dv MRT_MFC_RP 651flag was successfully set by 652.Fn setsockopt MRT_API_CONFIG , 653then the kernel will attempt to perform 654the PIM Register encapsulation itself instead of sending the 655multicast data packets to user level (inside 656.Dv IGMPMSG_WHOLEPKT 657upcalls) for user-level encapsulation. 658The RP address would be taken from the 659.Va mfcc_rp 660field 661inside the new 662.Vt "struct mfcctl2" . 663However, even if the 664.Dv MRT_MFC_RP 665flag was successfully set, if the 666.Va mfcc_rp 667field was set to 668.Dv INADDR_ANY , 669then the 670kernel will still deliver an 671.Dv IGMPMSG_WHOLEPKT 672upcall with the 673multicast data packet to the user-level process. 674.Pp 675In addition, if the multicast data packet is too large to fit within 676a single IP packet after the PIM Register encapsulation (e.g., if 677its size was on the order of 65500 bytes), the data packet will be 678fragmented, and then each of the fragments will be encapsulated 679separately. 680Note that typically a multicast data packet can be that 681large only if it was originated locally from the same hosts that 682performs the encapsulation; otherwise the transmission of the 683multicast data packet over Ethernet for example would have 684fragmented it into much smaller pieces. 685.\" 686.\" Note that if this code is ported to IPv6, we may need the kernel to 687.\" perform MTU discovery to the RP, and keep those discoveries inside 688.\" the kernel so the encapsulating router may send back ICMP 689.\" Fragmentation Required if the size of the multicast data packet is 690.\" too large (see "Encapsulating data packets in the Register Tunnel" 691.\" in Section 4.4.1 in the PIM-SM spec 692.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}). 693.\" For IPv4 we may be able to get away without it, but for IPv6 we need 694.\" that. 695.\" 696.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls". 697.\" 698.Pp 699Typically, a multicast routing user-level process would need to know the 700forwarding bandwidth for some data flow. 701For example, the multicast routing process may want to timeout idle MFC 702entries, or in case of PIM-SM it can initiate (S,G) shortest-path switch if 703the bandwidth rate is above a threshold for example. 704.Pp 705The original solution for measuring the bandwidth of a dataflow was 706that a user-level process would periodically 707query the kernel about the number of forwarded packets/bytes per 708(S,G), and then based on those numbers it would estimate whether a source 709has been idle, or whether the source's transmission bandwidth is above a 710threshold. 711That solution is far from being scalable, hence the need for a new 712mechanism for bandwidth monitoring. 713.Pp 714Below is a description of the bandwidth monitoring mechanism. 715.Bl -bullet 716.It 717If the bandwidth of a data flow satisfies some pre-defined filter, 718the kernel delivers an upcall on the multicast routing socket 719to the multicast routing process that has installed that filter. 720.It 721The bandwidth-upcall filters are installed per (S,G). 722There can be 723more than one filter per (S,G). 724.It 725Instead of supporting all possible comparison operations 726(i.e., < <= == != > >= ), there is support only for the 727<= and >= operations, 728because this makes the kernel-level implementation simpler, 729and because practically we need only those two. 730Further, the missing operations can be simulated by secondary 731user-level filtering of those <= and >= filters. 732For example, to simulate !=, then we need to install filter 733.Dq bw <= 0xffffffff , 734and after an 735upcall is received, we need to check whether 736.Dq measured_bw != expected_bw . 737.It 738The bandwidth-upcall mechanism is enabled by 739.Fn setsockopt MRT_API_CONFIG 740for the 741.Dv MRT_MFC_BW_UPCALL 742flag. 743.It 744The bandwidth-upcall filters are added/deleted by the new 745.Fn setsockopt MRT_ADD_BW_UPCALL 746and 747.Fn setsockopt MRT_DEL_BW_UPCALL 748respectively (with the appropriate 749.Vt "struct bw_upcall" 750argument of course). 751.El 752.Pp 753From application point of view, a developer needs to know about 754the following: 755.Bd -literal 756/* 757 * Structure for installing or delivering an upcall if the 758 * measured bandwidth is above or below a threshold. 759 * 760 * User programs (e.g. daemons) may have a need to know when the 761 * bandwidth used by some data flow is above or below some threshold. 762 * This interface allows the userland to specify the threshold (in 763 * bytes and/or packets) and the measurement interval. Flows are 764 * all packet with the same source and destination IP address. 765 * At the moment the code is only used for multicast destinations 766 * but there is nothing that prevents its use for unicast. 767 * 768 * The measurement interval cannot be shorter than some Tmin (currently, 3s). 769 * The threshold is set in packets and/or bytes per_interval. 770 * 771 * Measurement works as follows: 772 * 773 * For >= measurements: 774 * The first packet marks the start of a measurement interval. 775 * During an interval we count packets and bytes, and when we 776 * pass the threshold we deliver an upcall and we are done. 777 * The first packet after the end of the interval resets the 778 * count and restarts the measurement. 779 * 780 * For <= measurement: 781 * We start a timer to fire at the end of the interval, and 782 * then for each incoming packet we count packets and bytes. 783 * When the timer fires, we compare the value with the threshold, 784 * schedule an upcall if we are below, and restart the measurement 785 * (reschedule timer and zero counters). 786 */ 787 788struct bw_data { 789 struct timeval b_time; 790 uint64_t b_packets; 791 uint64_t b_bytes; 792}; 793 794struct bw_upcall { 795 struct in_addr bu_src; /* source address */ 796 struct in_addr bu_dst; /* destination address */ 797 uint32_t bu_flags; /* misc flags (see below) */ 798#define BW_UPCALL_UNIT_PACKETS (1 << 0) /* threshold (in packets) */ 799#define BW_UPCALL_UNIT_BYTES (1 << 1) /* threshold (in bytes) */ 800#define BW_UPCALL_GEQ (1 << 2) /* upcall if bw >= threshold */ 801#define BW_UPCALL_LEQ (1 << 3) /* upcall if bw <= threshold */ 802#define BW_UPCALL_DELETE_ALL (1 << 4) /* delete all upcalls for s,d*/ 803 struct bw_data bu_threshold; /* the bw threshold */ 804 struct bw_data bu_measured; /* the measured bw */ 805}; 806 807/* max. number of upcalls to deliver together */ 808#define BW_UPCALLS_MAX 128 809/* min. threshold time interval for bandwidth measurement */ 810#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC 3 811#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC 0 812.Ed 813.Pp 814The 815.Vt bw_upcall 816structure is used as an argument to 817.Fn setsockopt MRT_ADD_BW_UPCALL 818and 819.Fn setsockopt MRT_DEL_BW_UPCALL . 820Each 821.Fn setsockopt MRT_ADD_BW_UPCALL 822installs a filter in the kernel 823for the source and destination address in the 824.Vt bw_upcall 825argument, 826and that filter will trigger an upcall according to the following 827pseudo-algorithm: 828.Bd -literal 829 if (bw_upcall_oper IS ">=") { 830 if (((bw_upcall_unit & PACKETS == PACKETS) && 831 (measured_packets >= threshold_packets)) || 832 ((bw_upcall_unit & BYTES == BYTES) && 833 (measured_bytes >= threshold_bytes))) 834 SEND_UPCALL("measured bandwidth is >= threshold"); 835 } 836 if (bw_upcall_oper IS "<=" && measured_interval >= threshold_interval) { 837 if (((bw_upcall_unit & PACKETS == PACKETS) && 838 (measured_packets <= threshold_packets)) || 839 ((bw_upcall_unit & BYTES == BYTES) && 840 (measured_bytes <= threshold_bytes))) 841 SEND_UPCALL("measured bandwidth is <= threshold"); 842 } 843.Ed 844.Pp 845In the same 846.Vt bw_upcall 847the unit can be specified in both BYTES and PACKETS. 848However, the GEQ and LEQ flags are mutually exclusive. 849.Pp 850Basically, an upcall is delivered if the measured bandwidth is >= or 851<= the threshold bandwidth (within the specified measurement 852interval). 853For practical reasons, the smallest value for the measurement 854interval is 3 seconds. 855If smaller values are allowed, then the bandwidth 856estimation may be less accurate, or the potentially very high frequency 857of the generated upcalls may introduce too much overhead. 858For the >= operation, the answer may be known before the end of 859.Va threshold_interval , 860therefore the upcall may be delivered earlier. 861For the <= operation however, we must wait 862until the threshold interval has expired to know the answer. 863.Pp 864Example of usage: 865.Bd -literal 866struct bw_upcall bw_upcall; 867/* Assign all bw_upcall fields as appropriate */ 868memset(&bw_upcall, 0, sizeof(bw_upcall)); 869memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src)); 870memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst)); 871bw_upcall.bu_threshold.b_data = threshold_interval; 872bw_upcall.bu_threshold.b_packets = threshold_packets; 873bw_upcall.bu_threshold.b_bytes = threshold_bytes; 874if (is_threshold_in_packets) 875 bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS; 876if (is_threshold_in_bytes) 877 bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES; 878do { 879 if (is_geq_upcall) { 880 bw_upcall.bu_flags |= BW_UPCALL_GEQ; 881 break; 882 } 883 if (is_leq_upcall) { 884 bw_upcall.bu_flags |= BW_UPCALL_LEQ; 885 break; 886 } 887 return (ERROR); 888} while (0); 889setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL, 890 (void *)&bw_upcall, sizeof(bw_upcall)); 891.Ed 892.Pp 893To delete a single filter, then use 894.Dv MRT_DEL_BW_UPCALL , 895and the fields of bw_upcall must be set 896exactly same as when 897.Dv MRT_ADD_BW_UPCALL 898was called. 899.Pp 900To delete all bandwidth filters for a given (S,G), then 901only the 902.Va bu_src 903and 904.Va bu_dst 905fields in 906.Vt "struct bw_upcall" 907need to be set, and then just set only the 908.Dv BW_UPCALL_DELETE_ALL 909flag inside field 910.Va bw_upcall.bu_flags . 911.Pp 912The bandwidth upcalls are received by aggregating them in the new upcall 913message: 914.Bd -literal 915#define IGMPMSG_BW_UPCALL 4 /* BW monitoring upcall */ 916.Ed 917.Pp 918This message is an array of 919.Vt "struct bw_upcall" 920elements (up to 921.Dv BW_UPCALLS_MAX 922= 128). 923The upcalls are 924delivered when there are 128 pending upcalls, or when 1 second has 925expired since the previous upcall (whichever comes first). 926In an 927.Vt "struct upcall" 928element, the 929.Va bu_measured 930field is filled-in to 931indicate the particular measured values. 932However, because of the way 933the particular intervals are measured, the user should be careful how 934.Va bu_measured.b_time 935is used. 936For example, if the 937filter is installed to trigger an upcall if the number of packets 938is >= 1, then 939.Va bu_measured 940may have a value of zero in the upcalls after the 941first one, because the measured interval for >= filters is 942.Dq clocked 943by the forwarded packets. 944Hence, this upcall mechanism should not be used for measuring 945the exact value of the bandwidth of the forwarded data. 946To measure the exact bandwidth, the user would need to 947get the forwarded packets statistics with the 948.Fn ioctl SIOCGETSGCNT 949mechanism 950(see the 951.Sx Programming Guide 952section) . 953.Pp 954Note that the upcalls for a filter are delivered until the specific 955filter is deleted, but no more frequently than once per 956.Va bu_threshold.b_time . 957For example, if the filter is specified to 958deliver a signal if bw >= 1 packet, the first packet will trigger a 959signal, but the next upcall will be triggered no earlier than 960.Va bu_threshold.b_time 961after the previous upcall. 962.\" 963.Sh SEE ALSO 964.Xr getsockopt 2 , 965.Xr recvfrom 2 , 966.Xr recvmsg 2 , 967.Xr setsockopt 2 , 968.Xr socket 2 , 969.Xr sourcefilter 3 , 970.Xr altq 4 , 971.Xr dummynet 4 , 972.Xr gif 4 , 973.Xr gre 4 , 974.Xr icmp6 4 , 975.Xr igmp 4 , 976.Xr inet 4 , 977.Xr inet6 4 , 978.Xr intro 4 , 979.Xr ip 4 , 980.Xr ip6 4 , 981.Xr mld 4 , 982.Xr pim 4 983.\" 984.Sh HISTORY 985The Distance Vector Multicast Routing Protocol (DVMRP) 986was the first developed multicast routing protocol. 987Later, other protocols such as Multicast Extensions to OSPF (MOSPF) 988and Core Based Trees (CBT), were developed as well. 989Routers at autonomous system boundaries may now exchange multicast 990routes with peers via the Border Gateway Protocol (BGP). 991Many other routing protocols are able to redistribute multicast routes 992for use with 993.Dv PIM-SM 994and 995.Dv PIM-DM . 996.Sh AUTHORS 997.An -nosplit 998The original multicast code was written by 999.An David Waitzman 1000(BBN Labs), 1001and later modified by the following individuals: 1002.An Steve Deering 1003(Stanford), 1004.An Mark J. Steiglitz 1005(Stanford), 1006.An Van Jacobson 1007(LBL), 1008.An Ajit Thyagarajan 1009(PARC), 1010.An Bill Fenner 1011(PARC). 1012The IPv6 multicast support was implemented by the KAME project 1013.Pq Pa https://www.kame.net , 1014and was based on the IPv4 multicast code. 1015The advanced multicast API and the multicast bandwidth 1016monitoring were implemented by 1017.An Pavlin Radoslavov 1018(ICSI) 1019in collaboration with 1020.An Chris Brown 1021(NextHop). 1022The IGMPv3 and MLDv2 multicast support was implemented by 1023.An Bruce Simpson . 1024.Pp 1025This manual page was written by 1026.An Pavlin Radoslavov 1027(ICSI). 1028