1.\" Copyright (c) 2001-2003 International Computer Science Institute 2.\" 3.\" Permission is hereby granted, free of charge, to any person obtaining a 4.\" copy of this software and associated documentation files (the "Software"), 5.\" to deal in the Software without restriction, including without limitation 6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, 7.\" and/or sell copies of the Software, and to permit persons to whom the 8.\" Software is furnished to do so, subject to the following conditions: 9.\" 10.\" The above copyright notice and this permission notice shall be included in 11.\" all copies or substantial portions of the Software. 12.\" 13.\" The names and trademarks of copyright holders may not be used in 14.\" advertising or publicity pertaining to the software without specific 15.\" prior permission. Title to copyright in this software and any associated 16.\" documentation will at all times remain with the copyright holders. 17.\" 18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24.\" DEALINGS IN THE SOFTWARE. 25.\" 26.\" $FreeBSD$ 27.\" 28.Dd May 27, 2009 29.Dt MULTICAST 4 30.Os 31.\" 32.Sh NAME 33.Nm multicast 34.Nd Multicast Routing 35.\" 36.Sh SYNOPSIS 37.Cd "options MROUTING" 38.Pp 39.In sys/types.h 40.In sys/socket.h 41.In netinet/in.h 42.In netinet/ip_mroute.h 43.In netinet6/ip6_mroute.h 44.Ft int 45.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen" 46.Ft int 47.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen" 48.Ft int 49.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen" 50.Ft int 51.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen" 52.Sh DESCRIPTION 53.Tn "Multicast routing" 54is used to efficiently propagate data 55packets to a set of multicast listeners in multipoint networks. 56If unicast is used to replicate the data to all listeners, 57then some of the network links may carry multiple copies of the same 58data packets. 59With multicast routing, the overhead is reduced to one copy 60(at most) per network link. 61.Pp 62All multicast-capable routers must run a common multicast routing 63protocol. 64It is recommended that either 65Protocol Independent Multicast - Sparse Mode (PIM-SM), 66or Protocol Independent Multicast - Dense Mode (PIM-DM) 67are used, as these are now the generally accepted protocols 68in the Internet community. 69The 70.Sx HISTORY 71section discusses previous multicast routing protocols. 72.Pp 73To start multicast routing, 74the user must enable multicast forwarding in the kernel 75(see 76.Sx SYNOPSIS 77about the kernel configuration options), 78and must run a multicast routing capable user-level process. 79From developer's point of view, 80the programming guide described in the 81.Sx "Programming Guide" 82section should be used to control the multicast forwarding in the kernel. 83.\" 84.Ss Programming Guide 85This section provides information about the basic multicast routing API. 86The so-called 87.Dq advanced multicast API 88is described in the 89.Sx "Advanced Multicast API Programming Guide" 90section. 91.Pp 92First, a multicast routing socket must be open. 93That socket would be used 94to control the multicast forwarding in the kernel. 95Note that most operations below require certain privilege 96(i.e., root privilege): 97.Bd -literal 98/* IPv4 */ 99int mrouter_s4; 100mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP); 101.Ed 102.Bd -literal 103int mrouter_s6; 104mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); 105.Ed 106.Pp 107Note that if the router needs to open an IGMP or ICMPv6 socket 108(in case of IPv4 and IPv6 respectively) 109for sending or receiving of IGMP or MLD multicast group membership messages, 110then the same 111.Va mrouter_s4 112or 113.Va mrouter_s6 114sockets should be used 115for sending and receiving respectively IGMP or MLD messages. 116In case of 117.Bx Ns 118-derived kernel, it may be possible to open separate sockets 119for IGMP or MLD messages only. 120However, some other kernels (e.g., 121.Tn Linux ) 122require that the multicast 123routing socket must be used for sending and receiving of IGMP or MLD 124messages. 125Therefore, for portability reason the multicast 126routing socket should be reused for IGMP and MLD messages as well. 127.Pp 128After the multicast routing socket is open, it can be used to enable 129or disable multicast forwarding in the kernel: 130.Bd -literal 131/* IPv4 */ 132int v = 1; /* 1 to enable, or 0 to disable */ 133setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v)); 134.Ed 135.Bd -literal 136/* IPv6 */ 137int v = 1; /* 1 to enable, or 0 to disable */ 138setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v)); 139\&... 140/* If necessary, filter all ICMPv6 messages */ 141struct icmp6_filter filter; 142ICMP6_FILTER_SETBLOCKALL(&filter); 143setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter, 144 sizeof(filter)); 145.Ed 146.Pp 147After multicast forwarding is enabled, the multicast routing socket 148can be used to enable PIM processing in the kernel if we are running PIM-SM or 149PIM-DM 150(see 151.Xr pim 4 ) . 152.Pp 153For each network interface (e.g., physical or a virtual tunnel) 154that would be used for multicast forwarding, a corresponding 155multicast interface must be added to the kernel: 156.Bd -literal 157/* IPv4 */ 158struct vifctl vc; 159memset(&vc, 0, sizeof(vc)); 160/* Assign all vifctl fields as appropriate */ 161vc.vifc_vifi = vif_index; 162vc.vifc_flags = vif_flags; 163vc.vifc_threshold = min_ttl_threshold; 164vc.vifc_rate_limit = 0; 165memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr)); 166setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc, 167 sizeof(vc)); 168.Ed 169.Pp 170The 171.Va vif_index 172must be unique per vif. 173The 174.Va vif_flags 175contains the 176.Dv VIFF_* 177flags as defined in 178.In netinet/ip_mroute.h . 179The 180.Dv VIFF_TUNNEL 181flag is no longer supported by 182.Fx . 183Users who wish to forward multicast datagrams over a tunnel should consider 184configuring a 185.Xr gif 4 186or 187.Xr gre 4 188tunnel and using it as a physical interface. 189.Pp 190The 191.Va min_ttl_threshold 192contains the minimum TTL a multicast data packet must have to be 193forwarded on that vif. 194Typically, it would have value of 1. 195.Pp 196The 197.Va max_rate_limit 198argument is no longer supported in 199.Fx 200and should be set to 0. 201Users who wish to rate-limit multicast datagrams should consider the use of 202.Xr dummynet 4 203or 204.Xr altq 4 . 205.Pp 206The 207.Va vif_local_address 208contains the local IP address of the corresponding local interface. 209The 210.Va vif_remote_address 211contains the remote IP address in case of DVMRP multicast tunnels. 212.Bd -literal 213/* IPv6 */ 214struct mif6ctl mc; 215memset(&mc, 0, sizeof(mc)); 216/* Assign all mif6ctl fields as appropriate */ 217mc.mif6c_mifi = mif_index; 218mc.mif6c_flags = mif_flags; 219mc.mif6c_pifi = pif_index; 220setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc, 221 sizeof(mc)); 222.Ed 223.Pp 224The 225.Va mif_index 226must be unique per vif. 227The 228.Va mif_flags 229contains the 230.Dv MIFF_* 231flags as defined in 232.In netinet6/ip6_mroute.h . 233The 234.Va pif_index 235is the physical interface index of the corresponding local interface. 236.Pp 237A multicast interface is deleted by: 238.Bd -literal 239/* IPv4 */ 240vifi_t vifi = vif_index; 241setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi, 242 sizeof(vifi)); 243.Ed 244.Bd -literal 245/* IPv6 */ 246mifi_t mifi = mif_index; 247setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi, 248 sizeof(mifi)); 249.Ed 250.Pp 251After the multicast forwarding is enabled, and the multicast virtual 252interfaces are 253added, the kernel may deliver upcall messages (also called signals 254later in this text) on the multicast routing socket that was open 255earlier with 256.Dv MRT_INIT 257or 258.Dv MRT6_INIT . 259The IPv4 upcalls have 260.Vt "struct igmpmsg" 261header (see 262.In netinet/ip_mroute.h ) 263with field 264.Va im_mbz 265set to zero. 266Note that this header follows the structure of 267.Vt "struct ip" 268with the protocol field 269.Va ip_p 270set to zero. 271The IPv6 upcalls have 272.Vt "struct mrt6msg" 273header (see 274.In netinet6/ip6_mroute.h ) 275with field 276.Va im6_mbz 277set to zero. 278Note that this header follows the structure of 279.Vt "struct ip6_hdr" 280with the next header field 281.Va ip6_nxt 282set to zero. 283.Pp 284The upcall header contains field 285.Va im_msgtype 286and 287.Va im6_msgtype 288with the type of the upcall 289.Dv IGMPMSG_* 290and 291.Dv MRT6MSG_* 292for IPv4 and IPv6 respectively. 293The values of the rest of the upcall header fields 294and the body of the upcall message depend on the particular upcall type. 295.Pp 296If the upcall message type is 297.Dv IGMPMSG_NOCACHE 298or 299.Dv MRT6MSG_NOCACHE , 300this is an indication that a multicast packet has reached the multicast 301router, but the router has no forwarding state for that packet. 302Typically, the upcall would be a signal for the multicast routing 303user-level process to install the appropriate Multicast Forwarding 304Cache (MFC) entry in the kernel. 305.Pp 306An MFC entry is added by: 307.Bd -literal 308/* IPv4 */ 309struct mfcctl mc; 310memset(&mc, 0, sizeof(mc)); 311memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 312memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 313mc.mfcc_parent = iif_index; 314for (i = 0; i < maxvifs; i++) 315 mc.mfcc_ttls[i] = oifs_ttl[i]; 316setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, 317 (void *)&mc, sizeof(mc)); 318.Ed 319.Bd -literal 320/* IPv6 */ 321struct mf6cctl mc; 322memset(&mc, 0, sizeof(mc)); 323memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 324memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 325mc.mf6cc_parent = iif_index; 326for (i = 0; i < maxvifs; i++) 327 if (oifs_ttl[i] > 0) 328 IF_SET(i, &mc.mf6cc_ifset); 329setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MFC, 330 (void *)&mc, sizeof(mc)); 331.Ed 332.Pp 333The 334.Va source_addr 335and 336.Va group_addr 337are the source and group address of the multicast packet (as set 338in the upcall message). 339The 340.Va iif_index 341is the virtual interface index of the multicast interface the multicast 342packets for this specific source and group address should be received on. 343The 344.Va oifs_ttl[] 345array contains the minimum TTL (per interface) a multicast packet 346should have to be forwarded on an outgoing interface. 347If the TTL value is zero, the corresponding interface is not included 348in the set of outgoing interfaces. 349Note that in case of IPv6 only the set of outgoing interfaces can 350be specified. 351.Pp 352An MFC entry is deleted by: 353.Bd -literal 354/* IPv4 */ 355struct mfcctl mc; 356memset(&mc, 0, sizeof(mc)); 357memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 358memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 359setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, 360 (void *)&mc, sizeof(mc)); 361.Ed 362.Bd -literal 363/* IPv6 */ 364struct mf6cctl mc; 365memset(&mc, 0, sizeof(mc)); 366memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 367memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 368setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MFC, 369 (void *)&mc, sizeof(mc)); 370.Ed 371.Pp 372The following method can be used to get various statistics per 373installed MFC entry in the kernel (e.g., the number of forwarded 374packets per source and group address): 375.Bd -literal 376/* IPv4 */ 377struct sioc_sg_req sgreq; 378memset(&sgreq, 0, sizeof(sgreq)); 379memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 380memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 381ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq); 382.Ed 383.Bd -literal 384/* IPv6 */ 385struct sioc_sg_req6 sgreq; 386memset(&sgreq, 0, sizeof(sgreq)); 387memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 388memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 389ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq); 390.Ed 391.Pp 392The following method can be used to get various statistics per 393multicast virtual interface in the kernel (e.g., the number of forwarded 394packets per interface): 395.Bd -literal 396/* IPv4 */ 397struct sioc_vif_req vreq; 398memset(&vreq, 0, sizeof(vreq)); 399vreq.vifi = vif_index; 400ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq); 401.Ed 402.Bd -literal 403/* IPv6 */ 404struct sioc_mif_req6 mreq; 405memset(&mreq, 0, sizeof(mreq)); 406mreq.mifi = vif_index; 407ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq); 408.Ed 409.Ss Advanced Multicast API Programming Guide 410If we want to add new features in the kernel, it becomes difficult 411to preserve backward compatibility (binary and API), 412and at the same time to allow user-level processes to take advantage of 413the new features (if the kernel supports them). 414.Pp 415One of the mechanisms that allows us to preserve the backward 416compatibility is a sort of negotiation 417between the user-level process and the kernel: 418.Bl -enum 419.It 420The user-level process tries to enable in the kernel the set of new 421features (and the corresponding API) it would like to use. 422.It 423The kernel returns the (sub)set of features it knows about 424and is willing to be enabled. 425.It 426The user-level process uses only that set of features 427the kernel has agreed on. 428.El 429.\" 430.Pp 431To support backward compatibility, if the user-level process does not 432ask for any new features, the kernel defaults to the basic 433multicast API (see the 434.Sx "Programming Guide" 435section). 436.\" XXX: edit as appropriate after the advanced multicast API is 437.\" supported under IPv6 438Currently, the advanced multicast API exists only for IPv4; 439in the future there will be IPv6 support as well. 440.Pp 441Below is a summary of the expandable API solution. 442Note that all new options and structures are defined 443in 444.In netinet/ip_mroute.h 445and 446.In netinet6/ip6_mroute.h , 447unless stated otherwise. 448.Pp 449The user-level process uses new 450.Fn getsockopt Ns / Ns Fn setsockopt 451options to 452perform the API features negotiation with the kernel. 453This negotiation must be performed right after the multicast routing 454socket is open. 455The set of desired/allowed features is stored in a bitset 456(currently, in 457.Vt uint32_t ; 458i.e., maximum of 32 new features). 459The new 460.Fn getsockopt Ns / Ns Fn setsockopt 461options are 462.Dv MRT_API_SUPPORT 463and 464.Dv MRT_API_CONFIG . 465Example: 466.Bd -literal 467uint32_t v; 468getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v)); 469.Ed 470.Pp 471would set in 472.Va v 473the pre-defined bits that the kernel API supports. 474The eight least significant bits in 475.Vt uint32_t 476are same as the 477eight possible flags 478.Dv MRT_MFC_FLAGS_* 479that can be used in 480.Va mfcc_flags 481as part of the new definition of 482.Vt "struct mfcctl" 483(see below about those flags), which leaves 24 flags for other new features. 484The value returned by 485.Fn getsockopt MRT_API_SUPPORT 486is read-only; in other words, 487.Fn setsockopt MRT_API_SUPPORT 488would fail. 489.Pp 490To modify the API, and to set some specific feature in the kernel, then: 491.Bd -literal 492uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF; 493if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)) 494 != 0) { 495 return (ERROR); 496} 497if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF) 498 return (OK); /* Success */ 499else 500 return (ERROR); 501.Ed 502.Pp 503In other words, when 504.Fn setsockopt MRT_API_CONFIG 505is called, the 506argument to it specifies the desired set of features to 507be enabled in the API and the kernel. 508The return value in 509.Va v 510is the actual (sub)set of features that were enabled in the kernel. 511To obtain later the same set of features that were enabled, then: 512.Bd -literal 513getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)); 514.Ed 515.Pp 516The set of enabled features is global. 517In other words, 518.Fn setsockopt MRT_API_CONFIG 519should be called right after 520.Fn setsockopt MRT_INIT . 521.Pp 522Currently, the following set of new features is defined: 523.Bd -literal 524#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 525#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 526#define MRT_MFC_RP (1 << 8) /* enable RP address */ 527#define MRT_MFC_BW_UPCALL (1 << 9) /* enable bw upcalls */ 528.Ed 529.\" .Pp 530.\" In the future there might be: 531.\" .Bd -literal 532.\" #define MRT_MFC_GROUP_SPECIFIC (1 << 10) /* allow (*,G) MFC entries */ 533.\" .Ed 534.\" .Pp 535.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel. 536.\" For now this is left-out until it is clear whether 537.\" (*,G) MFC support is the preferred solution instead of something more generic 538.\" solution for example. 539.\" 540.\" 2. The newly defined struct mfcctl2. 541.\" 542.Pp 543The advanced multicast API uses a newly defined 544.Vt "struct mfcctl2" 545instead of the traditional 546.Vt "struct mfcctl" . 547The original 548.Vt "struct mfcctl" 549is kept as is. 550The new 551.Vt "struct mfcctl2" 552is: 553.Bd -literal 554/* 555 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays 556 * and extends the old struct mfcctl. 557 */ 558struct mfcctl2 { 559 /* the mfcctl fields */ 560 struct in_addr mfcc_origin; /* ip origin of mcasts */ 561 struct in_addr mfcc_mcastgrp; /* multicast group associated*/ 562 vifi_t mfcc_parent; /* incoming vif */ 563 u_char mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs */ 564 565 /* extension fields */ 566 uint8_t mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/ 567 struct in_addr mfcc_rp; /* the RP address */ 568}; 569.Ed 570.Pp 571The new fields are 572.Va mfcc_flags[MAXVIFS] 573and 574.Va mfcc_rp . 575Note that for compatibility reasons they are added at the end. 576.Pp 577The 578.Va mfcc_flags[MAXVIFS] 579field is used to set various flags per 580interface per (S,G) entry. 581Currently, the defined flags are: 582.Bd -literal 583#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 584#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 585.Ed 586.Pp 587The 588.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 589flag is used to explicitly disable the 590.Dv IGMPMSG_WRONGVIF 591kernel signal at the (S,G) granularity if a multicast data packet 592arrives on the wrong interface. 593Usually, this signal is used to 594complete the shortest-path switch in case of PIM-SM multicast routing, 595or to trigger a PIM assert message. 596However, it should not be delivered for interfaces that are not in 597the outgoing interface set, and that are not expecting to 598become an incoming interface. 599Hence, if the 600.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 601flag is set for some of the 602interfaces, then a data packet that arrives on that interface for 603that MFC entry will NOT trigger a WRONGVIF signal. 604If that flag is not set, then a signal is triggered (the default action). 605.Pp 606The 607.Dv MRT_MFC_FLAGS_BORDER_VIF 608flag is used to specify whether the Border-bit in PIM 609Register messages should be set (in case when the Register encapsulation 610is performed inside the kernel). 611If it is set for the special PIM Register kernel virtual interface 612(see 613.Xr pim 4 ) , 614the Border-bit in the Register messages sent to the RP will be set. 615.Pp 616The remaining six bits are reserved for future usage. 617.Pp 618The 619.Va mfcc_rp 620field is used to specify the RP address (in case of PIM-SM multicast routing) 621for a multicast 622group G if we want to perform kernel-level PIM Register encapsulation. 623The 624.Va mfcc_rp 625field is used only if the 626.Dv MRT_MFC_RP 627advanced API flag/capability has been successfully set by 628.Fn setsockopt MRT_API_CONFIG . 629.Pp 630.\" 631.\" 3. Kernel-level PIM Register encapsulation 632.\" 633If the 634.Dv MRT_MFC_RP 635flag was successfully set by 636.Fn setsockopt MRT_API_CONFIG , 637then the kernel will attempt to perform 638the PIM Register encapsulation itself instead of sending the 639multicast data packets to user level (inside 640.Dv IGMPMSG_WHOLEPKT 641upcalls) for user-level encapsulation. 642The RP address would be taken from the 643.Va mfcc_rp 644field 645inside the new 646.Vt "struct mfcctl2" . 647However, even if the 648.Dv MRT_MFC_RP 649flag was successfully set, if the 650.Va mfcc_rp 651field was set to 652.Dv INADDR_ANY , 653then the 654kernel will still deliver an 655.Dv IGMPMSG_WHOLEPKT 656upcall with the 657multicast data packet to the user-level process. 658.Pp 659In addition, if the multicast data packet is too large to fit within 660a single IP packet after the PIM Register encapsulation (e.g., if 661its size was on the order of 65500 bytes), the data packet will be 662fragmented, and then each of the fragments will be encapsulated 663separately. 664Note that typically a multicast data packet can be that 665large only if it was originated locally from the same hosts that 666performs the encapsulation; otherwise the transmission of the 667multicast data packet over Ethernet for example would have 668fragmented it into much smaller pieces. 669.\" 670.\" Note that if this code is ported to IPv6, we may need the kernel to 671.\" perform MTU discovery to the RP, and keep those discoveries inside 672.\" the kernel so the encapsulating router may send back ICMP 673.\" Fragmentation Required if the size of the multicast data packet is 674.\" too large (see "Encapsulating data packets in the Register Tunnel" 675.\" in Section 4.4.1 in the PIM-SM spec 676.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}). 677.\" For IPv4 we may be able to get away without it, but for IPv6 we need 678.\" that. 679.\" 680.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls". 681.\" 682.Pp 683Typically, a multicast routing user-level process would need to know the 684forwarding bandwidth for some data flow. 685For example, the multicast routing process may want to timeout idle MFC 686entries, or in case of PIM-SM it can initiate (S,G) shortest-path switch if 687the bandwidth rate is above a threshold for example. 688.Pp 689The original solution for measuring the bandwidth of a dataflow was 690that a user-level process would periodically 691query the kernel about the number of forwarded packets/bytes per 692(S,G), and then based on those numbers it would estimate whether a source 693has been idle, or whether the source's transmission bandwidth is above a 694threshold. 695That solution is far from being scalable, hence the need for a new 696mechanism for bandwidth monitoring. 697.Pp 698Below is a description of the bandwidth monitoring mechanism. 699.Bl -bullet 700.It 701If the bandwidth of a data flow satisfies some pre-defined filter, 702the kernel delivers an upcall on the multicast routing socket 703to the multicast routing process that has installed that filter. 704.It 705The bandwidth-upcall filters are installed per (S,G). 706There can be 707more than one filter per (S,G). 708.It 709Instead of supporting all possible comparison operations 710(i.e., < <= == != > >= ), there is support only for the 711<= and >= operations, 712because this makes the kernel-level implementation simpler, 713and because practically we need only those two. 714Further, the missing operations can be simulated by secondary 715user-level filtering of those <= and >= filters. 716For example, to simulate !=, then we need to install filter 717.Dq bw <= 0xffffffff , 718and after an 719upcall is received, we need to check whether 720.Dq measured_bw != expected_bw . 721.It 722The bandwidth-upcall mechanism is enabled by 723.Fn setsockopt MRT_API_CONFIG 724for the 725.Dv MRT_MFC_BW_UPCALL 726flag. 727.It 728The bandwidth-upcall filters are added/deleted by the new 729.Fn setsockopt MRT_ADD_BW_UPCALL 730and 731.Fn setsockopt MRT_DEL_BW_UPCALL 732respectively (with the appropriate 733.Vt "struct bw_upcall" 734argument of course). 735.El 736.Pp 737From application point of view, a developer needs to know about 738the following: 739.Bd -literal 740/* 741 * Structure for installing or delivering an upcall if the 742 * measured bandwidth is above or below a threshold. 743 * 744 * User programs (e.g. daemons) may have a need to know when the 745 * bandwidth used by some data flow is above or below some threshold. 746 * This interface allows the userland to specify the threshold (in 747 * bytes and/or packets) and the measurement interval. Flows are 748 * all packet with the same source and destination IP address. 749 * At the moment the code is only used for multicast destinations 750 * but there is nothing that prevents its use for unicast. 751 * 752 * The measurement interval cannot be shorter than some Tmin (currently, 3s). 753 * The threshold is set in packets and/or bytes per_interval. 754 * 755 * Measurement works as follows: 756 * 757 * For >= measurements: 758 * The first packet marks the start of a measurement interval. 759 * During an interval we count packets and bytes, and when we 760 * pass the threshold we deliver an upcall and we are done. 761 * The first packet after the end of the interval resets the 762 * count and restarts the measurement. 763 * 764 * For <= measurement: 765 * We start a timer to fire at the end of the interval, and 766 * then for each incoming packet we count packets and bytes. 767 * When the timer fires, we compare the value with the threshold, 768 * schedule an upcall if we are below, and restart the measurement 769 * (reschedule timer and zero counters). 770 */ 771 772struct bw_data { 773 struct timeval b_time; 774 uint64_t b_packets; 775 uint64_t b_bytes; 776}; 777 778struct bw_upcall { 779 struct in_addr bu_src; /* source address */ 780 struct in_addr bu_dst; /* destination address */ 781 uint32_t bu_flags; /* misc flags (see below) */ 782#define BW_UPCALL_UNIT_PACKETS (1 << 0) /* threshold (in packets) */ 783#define BW_UPCALL_UNIT_BYTES (1 << 1) /* threshold (in bytes) */ 784#define BW_UPCALL_GEQ (1 << 2) /* upcall if bw >= threshold */ 785#define BW_UPCALL_LEQ (1 << 3) /* upcall if bw <= threshold */ 786#define BW_UPCALL_DELETE_ALL (1 << 4) /* delete all upcalls for s,d*/ 787 struct bw_data bu_threshold; /* the bw threshold */ 788 struct bw_data bu_measured; /* the measured bw */ 789}; 790 791/* max. number of upcalls to deliver together */ 792#define BW_UPCALLS_MAX 128 793/* min. threshold time interval for bandwidth measurement */ 794#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC 3 795#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC 0 796.Ed 797.Pp 798The 799.Vt bw_upcall 800structure is used as an argument to 801.Fn setsockopt MRT_ADD_BW_UPCALL 802and 803.Fn setsockopt MRT_DEL_BW_UPCALL . 804Each 805.Fn setsockopt MRT_ADD_BW_UPCALL 806installs a filter in the kernel 807for the source and destination address in the 808.Vt bw_upcall 809argument, 810and that filter will trigger an upcall according to the following 811pseudo-algorithm: 812.Bd -literal 813 if (bw_upcall_oper IS ">=") { 814 if (((bw_upcall_unit & PACKETS == PACKETS) && 815 (measured_packets >= threshold_packets)) || 816 ((bw_upcall_unit & BYTES == BYTES) && 817 (measured_bytes >= threshold_bytes))) 818 SEND_UPCALL("measured bandwidth is >= threshold"); 819 } 820 if (bw_upcall_oper IS "<=" && measured_interval >= threshold_interval) { 821 if (((bw_upcall_unit & PACKETS == PACKETS) && 822 (measured_packets <= threshold_packets)) || 823 ((bw_upcall_unit & BYTES == BYTES) && 824 (measured_bytes <= threshold_bytes))) 825 SEND_UPCALL("measured bandwidth is <= threshold"); 826 } 827.Ed 828.Pp 829In the same 830.Vt bw_upcall 831the unit can be specified in both BYTES and PACKETS. 832However, the GEQ and LEQ flags are mutually exclusive. 833.Pp 834Basically, an upcall is delivered if the measured bandwidth is >= or 835<= the threshold bandwidth (within the specified measurement 836interval). 837For practical reasons, the smallest value for the measurement 838interval is 3 seconds. 839If smaller values are allowed, then the bandwidth 840estimation may be less accurate, or the potentially very high frequency 841of the generated upcalls may introduce too much overhead. 842For the >= operation, the answer may be known before the end of 843.Va threshold_interval , 844therefore the upcall may be delivered earlier. 845For the <= operation however, we must wait 846until the threshold interval has expired to know the answer. 847.Pp 848Example of usage: 849.Bd -literal 850struct bw_upcall bw_upcall; 851/* Assign all bw_upcall fields as appropriate */ 852memset(&bw_upcall, 0, sizeof(bw_upcall)); 853memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src)); 854memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst)); 855bw_upcall.bu_threshold.b_data = threshold_interval; 856bw_upcall.bu_threshold.b_packets = threshold_packets; 857bw_upcall.bu_threshold.b_bytes = threshold_bytes; 858if (is_threshold_in_packets) 859 bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS; 860if (is_threshold_in_bytes) 861 bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES; 862do { 863 if (is_geq_upcall) { 864 bw_upcall.bu_flags |= BW_UPCALL_GEQ; 865 break; 866 } 867 if (is_leq_upcall) { 868 bw_upcall.bu_flags |= BW_UPCALL_LEQ; 869 break; 870 } 871 return (ERROR); 872} while (0); 873setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL, 874 (void *)&bw_upcall, sizeof(bw_upcall)); 875.Ed 876.Pp 877To delete a single filter, then use 878.Dv MRT_DEL_BW_UPCALL , 879and the fields of bw_upcall must be set 880exactly same as when 881.Dv MRT_ADD_BW_UPCALL 882was called. 883.Pp 884To delete all bandwidth filters for a given (S,G), then 885only the 886.Va bu_src 887and 888.Va bu_dst 889fields in 890.Vt "struct bw_upcall" 891need to be set, and then just set only the 892.Dv BW_UPCALL_DELETE_ALL 893flag inside field 894.Va bw_upcall.bu_flags . 895.Pp 896The bandwidth upcalls are received by aggregating them in the new upcall 897message: 898.Bd -literal 899#define IGMPMSG_BW_UPCALL 4 /* BW monitoring upcall */ 900.Ed 901.Pp 902This message is an array of 903.Vt "struct bw_upcall" 904elements (up to 905.Dv BW_UPCALLS_MAX 906= 128). 907The upcalls are 908delivered when there are 128 pending upcalls, or when 1 second has 909expired since the previous upcall (whichever comes first). 910In an 911.Vt "struct upcall" 912element, the 913.Va bu_measured 914field is filled-in to 915indicate the particular measured values. 916However, because of the way 917the particular intervals are measured, the user should be careful how 918.Va bu_measured.b_time 919is used. 920For example, if the 921filter is installed to trigger an upcall if the number of packets 922is >= 1, then 923.Va bu_measured 924may have a value of zero in the upcalls after the 925first one, because the measured interval for >= filters is 926.Dq clocked 927by the forwarded packets. 928Hence, this upcall mechanism should not be used for measuring 929the exact value of the bandwidth of the forwarded data. 930To measure the exact bandwidth, the user would need to 931get the forwarded packets statistics with the 932.Fn ioctl SIOCGETSGCNT 933mechanism 934(see the 935.Sx Programming Guide 936section) . 937.Pp 938Note that the upcalls for a filter are delivered until the specific 939filter is deleted, but no more frequently than once per 940.Va bu_threshold.b_time . 941For example, if the filter is specified to 942deliver a signal if bw >= 1 packet, the first packet will trigger a 943signal, but the next upcall will be triggered no earlier than 944.Va bu_threshold.b_time 945after the previous upcall. 946.\" 947.Sh SEE ALSO 948.Xr altq 4 , 949.Xr dummynet 4 , 950.Xr getsockopt 2 , 951.Xr gif 4 , 952.Xr gre 4 , 953.Xr recvfrom 2 , 954.Xr recvmsg 2 , 955.Xr setsockopt 2 , 956.Xr socket 2 , 957.Xr sourcefilter 3 , 958.Xr icmp6 4 , 959.Xr igmp 4 , 960.Xr inet 4 , 961.Xr inet6 4 , 962.Xr intro 4 , 963.Xr ip 4 , 964.Xr ip6 4 , 965.Xr mld 4 , 966.Xr pim 4 967.\" 968.Sh HISTORY 969The Distance Vector Multicast Routing Protocol (DVMRP) 970was the first developed multicast routing protocol. 971Later, other protocols such as Multicast Extensions to OSPF (MOSPF) 972and Core Based Trees (CBT), were developed as well. 973Routers at autonomous system boundaries may now exchange multicast 974routes with peers via the Border Gateway Protocol (BGP). 975Many other routing protocols are able to redistribute multicast routes 976for use with 977.Dv PIM-SM 978and 979.Dv PIM-DM . 980.Sh AUTHORS 981.An -nosplit 982The original multicast code was written by 983.An David Waitzman 984(BBN Labs), 985and later modified by the following individuals: 986.An Steve Deering 987(Stanford), 988.An Mark J. Steiglitz 989(Stanford), 990.An Van Jacobson 991(LBL), 992.An Ajit Thyagarajan 993(PARC), 994.An Bill Fenner 995(PARC). 996The IPv6 multicast support was implemented by the KAME project 997.Pq Pa http://www.kame.net , 998and was based on the IPv4 multicast code. 999The advanced multicast API and the multicast bandwidth 1000monitoring were implemented by 1001.An Pavlin Radoslavov 1002(ICSI) 1003in collaboration with 1004.An Chris Brown 1005(NextHop). 1006The IGMPv3 and MLDv2 multicast support was implemented by 1007.An Bruce Simpson . 1008.Pp 1009This manual page was written by 1010.An Pavlin Radoslavov 1011(ICSI). 1012