1.\" Copyright (c) 2001-2003 International Computer Science Institute 2.\" 3.\" Permission is hereby granted, free of charge, to any person obtaining a 4.\" copy of this software and associated documentation files (the "Software"), 5.\" to deal in the Software without restriction, including without limitation 6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, 7.\" and/or sell copies of the Software, and to permit persons to whom the 8.\" Software is furnished to do so, subject to the following conditions: 9.\" 10.\" The above copyright notice and this permission notice shall be included in 11.\" all copies or substantial portions of the Software. 12.\" 13.\" The names and trademarks of copyright holders may not be used in 14.\" advertising or publicity pertaining to the software without specific 15.\" prior permission. Title to copyright in this software and any associated 16.\" documentation will at all times remain with the copyright holders. 17.\" 18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24.\" DEALINGS IN THE SOFTWARE. 25.\" 26.Dd May 27, 2009 27.Dt MULTICAST 4 28.Os 29.\" 30.Sh NAME 31.Nm multicast 32.Nd Multicast Routing 33.\" 34.Sh SYNOPSIS 35.Cd "options MROUTING" 36.Pp 37.In sys/types.h 38.In sys/socket.h 39.In netinet/in.h 40.In netinet/ip_mroute.h 41.In netinet6/ip6_mroute.h 42.Ft int 43.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen" 44.Ft int 45.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen" 46.Ft int 47.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen" 48.Ft int 49.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen" 50.Sh DESCRIPTION 51.Tn "Multicast routing" 52is used to efficiently propagate data 53packets to a set of multicast listeners in multipoint networks. 54If unicast is used to replicate the data to all listeners, 55then some of the network links may carry multiple copies of the same 56data packets. 57With multicast routing, the overhead is reduced to one copy 58(at most) per network link. 59.Pp 60All multicast-capable routers must run a common multicast routing 61protocol. 62It is recommended that either 63Protocol Independent Multicast - Sparse Mode (PIM-SM), 64or Protocol Independent Multicast - Dense Mode (PIM-DM) 65are used, as these are now the generally accepted protocols 66in the Internet community. 67The 68.Sx HISTORY 69section discusses previous multicast routing protocols. 70.Pp 71To start multicast routing, 72the user must enable multicast forwarding in the kernel 73(see 74.Sx SYNOPSIS 75about the kernel configuration options), 76and must run a multicast routing capable user-level process. 77From developer's point of view, 78the programming guide described in the 79.Sx "Programming Guide" 80section should be used to control the multicast forwarding in the kernel. 81.\" 82.Ss Programming Guide 83This section provides information about the basic multicast routing API. 84The so-called 85.Dq advanced multicast API 86is described in the 87.Sx "Advanced Multicast API Programming Guide" 88section. 89.Pp 90First, a multicast routing socket must be open. 91That socket would be used 92to control the multicast forwarding in the kernel. 93Note that most operations below require certain privilege 94(i.e., root privilege): 95.Bd -literal 96/* IPv4 */ 97int mrouter_s4; 98mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP); 99.Ed 100.Bd -literal 101int mrouter_s6; 102mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); 103.Ed 104.Pp 105Note that if the router needs to open an IGMP or ICMPv6 socket 106(in case of IPv4 and IPv6 respectively) 107for sending or receiving of IGMP or MLD multicast group membership messages, 108then the same 109.Va mrouter_s4 110or 111.Va mrouter_s6 112sockets should be used 113for sending and receiving respectively IGMP or MLD messages. 114In case of 115.Bx Ns 116-derived kernel, it may be possible to open separate sockets 117for IGMP or MLD messages only. 118However, some other kernels (e.g., 119.Tn Linux ) 120require that the multicast 121routing socket must be used for sending and receiving of IGMP or MLD 122messages. 123Therefore, for portability reason the multicast 124routing socket should be reused for IGMP and MLD messages as well. 125.Pp 126After the multicast routing socket is open, it can be used to enable 127or disable multicast forwarding in the kernel: 128.Bd -literal 129/* IPv4 */ 130int v = 1; /* 1 to enable, or 0 to disable */ 131setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v)); 132.Ed 133.Bd -literal 134/* IPv6 */ 135int v = 1; /* 1 to enable, or 0 to disable */ 136setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v)); 137\&... 138/* If necessary, filter all ICMPv6 messages */ 139struct icmp6_filter filter; 140ICMP6_FILTER_SETBLOCKALL(&filter); 141setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter, 142 sizeof(filter)); 143.Ed 144.Pp 145After multicast forwarding is enabled, the multicast routing socket 146can be used to enable PIM processing in the kernel if we are running PIM-SM or 147PIM-DM 148(see 149.Xr pim 4 ) . 150.Pp 151For each network interface (e.g., physical or a virtual tunnel) 152that would be used for multicast forwarding, a corresponding 153multicast interface must be added to the kernel: 154.Bd -literal 155/* IPv4 */ 156struct vifctl vc; 157memset(&vc, 0, sizeof(vc)); 158/* Assign all vifctl fields as appropriate */ 159vc.vifc_vifi = vif_index; 160vc.vifc_flags = vif_flags; 161vc.vifc_threshold = min_ttl_threshold; 162vc.vifc_rate_limit = 0; 163memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr)); 164setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc, 165 sizeof(vc)); 166.Ed 167.Pp 168The 169.Va vif_index 170must be unique per vif. 171The 172.Va vif_flags 173contains the 174.Dv VIFF_* 175flags as defined in 176.In netinet/ip_mroute.h . 177The 178.Dv VIFF_TUNNEL 179flag is no longer supported by 180.Fx . 181Users who wish to forward multicast datagrams over a tunnel should consider 182configuring a 183.Xr gif 4 184or 185.Xr gre 4 186tunnel and using it as a physical interface. 187.Pp 188The 189.Va min_ttl_threshold 190contains the minimum TTL a multicast data packet must have to be 191forwarded on that vif. 192Typically, it would have value of 1. 193.Pp 194The 195.Va max_rate_limit 196argument is no longer supported in 197.Fx 198and should be set to 0. 199Users who wish to rate-limit multicast datagrams should consider the use of 200.Xr dummynet 4 201or 202.Xr altq 4 . 203.Pp 204The 205.Va vif_local_address 206contains the local IP address of the corresponding local interface. 207The 208.Va vif_remote_address 209contains the remote IP address in case of DVMRP multicast tunnels. 210.Bd -literal 211/* IPv6 */ 212struct mif6ctl mc; 213memset(&mc, 0, sizeof(mc)); 214/* Assign all mif6ctl fields as appropriate */ 215mc.mif6c_mifi = mif_index; 216mc.mif6c_flags = mif_flags; 217mc.mif6c_pifi = pif_index; 218setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc, 219 sizeof(mc)); 220.Ed 221.Pp 222The 223.Va mif_index 224must be unique per vif. 225The 226.Va mif_flags 227contains the 228.Dv MIFF_* 229flags as defined in 230.In netinet6/ip6_mroute.h . 231The 232.Va pif_index 233is the physical interface index of the corresponding local interface. 234.Pp 235A multicast interface is deleted by: 236.Bd -literal 237/* IPv4 */ 238vifi_t vifi = vif_index; 239setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi, 240 sizeof(vifi)); 241.Ed 242.Bd -literal 243/* IPv6 */ 244mifi_t mifi = mif_index; 245setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi, 246 sizeof(mifi)); 247.Ed 248.Pp 249After the multicast forwarding is enabled, and the multicast virtual 250interfaces are 251added, the kernel may deliver upcall messages (also called signals 252later in this text) on the multicast routing socket that was open 253earlier with 254.Dv MRT_INIT 255or 256.Dv MRT6_INIT . 257The IPv4 upcalls have 258.Vt "struct igmpmsg" 259header (see 260.In netinet/ip_mroute.h ) 261with field 262.Va im_mbz 263set to zero. 264Note that this header follows the structure of 265.Vt "struct ip" 266with the protocol field 267.Va ip_p 268set to zero. 269The IPv6 upcalls have 270.Vt "struct mrt6msg" 271header (see 272.In netinet6/ip6_mroute.h ) 273with field 274.Va im6_mbz 275set to zero. 276Note that this header follows the structure of 277.Vt "struct ip6_hdr" 278with the next header field 279.Va ip6_nxt 280set to zero. 281.Pp 282The upcall header contains field 283.Va im_msgtype 284and 285.Va im6_msgtype 286with the type of the upcall 287.Dv IGMPMSG_* 288and 289.Dv MRT6MSG_* 290for IPv4 and IPv6 respectively. 291The values of the rest of the upcall header fields 292and the body of the upcall message depend on the particular upcall type. 293.Pp 294If the upcall message type is 295.Dv IGMPMSG_NOCACHE 296or 297.Dv MRT6MSG_NOCACHE , 298this is an indication that a multicast packet has reached the multicast 299router, but the router has no forwarding state for that packet. 300Typically, the upcall would be a signal for the multicast routing 301user-level process to install the appropriate Multicast Forwarding 302Cache (MFC) entry in the kernel. 303.Pp 304An MFC entry is added by: 305.Bd -literal 306/* IPv4 */ 307struct mfcctl mc; 308memset(&mc, 0, sizeof(mc)); 309memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 310memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 311mc.mfcc_parent = iif_index; 312for (i = 0; i < maxvifs; i++) 313 mc.mfcc_ttls[i] = oifs_ttl[i]; 314setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, 315 (void *)&mc, sizeof(mc)); 316.Ed 317.Bd -literal 318/* IPv6 */ 319struct mf6cctl mc; 320memset(&mc, 0, sizeof(mc)); 321memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 322memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 323mc.mf6cc_parent = iif_index; 324for (i = 0; i < maxvifs; i++) 325 if (oifs_ttl[i] > 0) 326 IF_SET(i, &mc.mf6cc_ifset); 327setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MFC, 328 (void *)&mc, sizeof(mc)); 329.Ed 330.Pp 331The 332.Va source_addr 333and 334.Va group_addr 335are the source and group address of the multicast packet (as set 336in the upcall message). 337The 338.Va iif_index 339is the virtual interface index of the multicast interface the multicast 340packets for this specific source and group address should be received on. 341The 342.Va oifs_ttl[] 343array contains the minimum TTL (per interface) a multicast packet 344should have to be forwarded on an outgoing interface. 345If the TTL value is zero, the corresponding interface is not included 346in the set of outgoing interfaces. 347Note that in case of IPv6 only the set of outgoing interfaces can 348be specified. 349.Pp 350An MFC entry is deleted by: 351.Bd -literal 352/* IPv4 */ 353struct mfcctl mc; 354memset(&mc, 0, sizeof(mc)); 355memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 356memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 357setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, 358 (void *)&mc, sizeof(mc)); 359.Ed 360.Bd -literal 361/* IPv6 */ 362struct mf6cctl mc; 363memset(&mc, 0, sizeof(mc)); 364memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 365memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 366setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MFC, 367 (void *)&mc, sizeof(mc)); 368.Ed 369.Pp 370The following method can be used to get various statistics per 371installed MFC entry in the kernel (e.g., the number of forwarded 372packets per source and group address): 373.Bd -literal 374/* IPv4 */ 375struct sioc_sg_req sgreq; 376memset(&sgreq, 0, sizeof(sgreq)); 377memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 378memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 379ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq); 380.Ed 381.Bd -literal 382/* IPv6 */ 383struct sioc_sg_req6 sgreq; 384memset(&sgreq, 0, sizeof(sgreq)); 385memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 386memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 387ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq); 388.Ed 389.Pp 390The following method can be used to get various statistics per 391multicast virtual interface in the kernel (e.g., the number of forwarded 392packets per interface): 393.Bd -literal 394/* IPv4 */ 395struct sioc_vif_req vreq; 396memset(&vreq, 0, sizeof(vreq)); 397vreq.vifi = vif_index; 398ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq); 399.Ed 400.Bd -literal 401/* IPv6 */ 402struct sioc_mif_req6 mreq; 403memset(&mreq, 0, sizeof(mreq)); 404mreq.mifi = vif_index; 405ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq); 406.Ed 407.Ss Advanced Multicast API Programming Guide 408If we want to add new features in the kernel, it becomes difficult 409to preserve backward compatibility (binary and API), 410and at the same time to allow user-level processes to take advantage of 411the new features (if the kernel supports them). 412.Pp 413One of the mechanisms that allows us to preserve the backward 414compatibility is a sort of negotiation 415between the user-level process and the kernel: 416.Bl -enum 417.It 418The user-level process tries to enable in the kernel the set of new 419features (and the corresponding API) it would like to use. 420.It 421The kernel returns the (sub)set of features it knows about 422and is willing to be enabled. 423.It 424The user-level process uses only that set of features 425the kernel has agreed on. 426.El 427.\" 428.Pp 429To support backward compatibility, if the user-level process does not 430ask for any new features, the kernel defaults to the basic 431multicast API (see the 432.Sx "Programming Guide" 433section). 434.\" XXX: edit as appropriate after the advanced multicast API is 435.\" supported under IPv6 436Currently, the advanced multicast API exists only for IPv4; 437in the future there will be IPv6 support as well. 438.Pp 439Below is a summary of the expandable API solution. 440Note that all new options and structures are defined 441in 442.In netinet/ip_mroute.h 443and 444.In netinet6/ip6_mroute.h , 445unless stated otherwise. 446.Pp 447The user-level process uses new 448.Fn getsockopt Ns / Ns Fn setsockopt 449options to 450perform the API features negotiation with the kernel. 451This negotiation must be performed right after the multicast routing 452socket is open. 453The set of desired/allowed features is stored in a bitset 454(currently, in 455.Vt uint32_t ; 456i.e., maximum of 32 new features). 457The new 458.Fn getsockopt Ns / Ns Fn setsockopt 459options are 460.Dv MRT_API_SUPPORT 461and 462.Dv MRT_API_CONFIG . 463Example: 464.Bd -literal 465uint32_t v; 466getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v)); 467.Ed 468.Pp 469would set in 470.Va v 471the pre-defined bits that the kernel API supports. 472The eight least significant bits in 473.Vt uint32_t 474are same as the 475eight possible flags 476.Dv MRT_MFC_FLAGS_* 477that can be used in 478.Va mfcc_flags 479as part of the new definition of 480.Vt "struct mfcctl" 481(see below about those flags), which leaves 24 flags for other new features. 482The value returned by 483.Fn getsockopt MRT_API_SUPPORT 484is read-only; in other words, 485.Fn setsockopt MRT_API_SUPPORT 486would fail. 487.Pp 488To modify the API, and to set some specific feature in the kernel, then: 489.Bd -literal 490uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF; 491if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)) 492 != 0) { 493 return (ERROR); 494} 495if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF) 496 return (OK); /* Success */ 497else 498 return (ERROR); 499.Ed 500.Pp 501In other words, when 502.Fn setsockopt MRT_API_CONFIG 503is called, the 504argument to it specifies the desired set of features to 505be enabled in the API and the kernel. 506The return value in 507.Va v 508is the actual (sub)set of features that were enabled in the kernel. 509To obtain later the same set of features that were enabled, then: 510.Bd -literal 511getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)); 512.Ed 513.Pp 514The set of enabled features is global. 515In other words, 516.Fn setsockopt MRT_API_CONFIG 517should be called right after 518.Fn setsockopt MRT_INIT . 519.Pp 520Currently, the following set of new features is defined: 521.Bd -literal 522#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 523#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 524#define MRT_MFC_RP (1 << 8) /* enable RP address */ 525#define MRT_MFC_BW_UPCALL (1 << 9) /* enable bw upcalls */ 526.Ed 527.\" .Pp 528.\" In the future there might be: 529.\" .Bd -literal 530.\" #define MRT_MFC_GROUP_SPECIFIC (1 << 10) /* allow (*,G) MFC entries */ 531.\" .Ed 532.\" .Pp 533.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel. 534.\" For now this is left-out until it is clear whether 535.\" (*,G) MFC support is the preferred solution instead of something more generic 536.\" solution for example. 537.\" 538.\" 2. The newly defined struct mfcctl2. 539.\" 540.Pp 541The advanced multicast API uses a newly defined 542.Vt "struct mfcctl2" 543instead of the traditional 544.Vt "struct mfcctl" . 545The original 546.Vt "struct mfcctl" 547is kept as is. 548The new 549.Vt "struct mfcctl2" 550is: 551.Bd -literal 552/* 553 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays 554 * and extends the old struct mfcctl. 555 */ 556struct mfcctl2 { 557 /* the mfcctl fields */ 558 struct in_addr mfcc_origin; /* ip origin of mcasts */ 559 struct in_addr mfcc_mcastgrp; /* multicast group associated*/ 560 vifi_t mfcc_parent; /* incoming vif */ 561 u_char mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs */ 562 563 /* extension fields */ 564 uint8_t mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/ 565 struct in_addr mfcc_rp; /* the RP address */ 566}; 567.Ed 568.Pp 569The new fields are 570.Va mfcc_flags[MAXVIFS] 571and 572.Va mfcc_rp . 573Note that for compatibility reasons they are added at the end. 574.Pp 575The 576.Va mfcc_flags[MAXVIFS] 577field is used to set various flags per 578interface per (S,G) entry. 579Currently, the defined flags are: 580.Bd -literal 581#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 582#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 583.Ed 584.Pp 585The 586.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 587flag is used to explicitly disable the 588.Dv IGMPMSG_WRONGVIF 589kernel signal at the (S,G) granularity if a multicast data packet 590arrives on the wrong interface. 591Usually, this signal is used to 592complete the shortest-path switch in case of PIM-SM multicast routing, 593or to trigger a PIM assert message. 594However, it should not be delivered for interfaces that are not in 595the outgoing interface set, and that are not expecting to 596become an incoming interface. 597Hence, if the 598.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 599flag is set for some of the 600interfaces, then a data packet that arrives on that interface for 601that MFC entry will NOT trigger a WRONGVIF signal. 602If that flag is not set, then a signal is triggered (the default action). 603.Pp 604The 605.Dv MRT_MFC_FLAGS_BORDER_VIF 606flag is used to specify whether the Border-bit in PIM 607Register messages should be set (in case when the Register encapsulation 608is performed inside the kernel). 609If it is set for the special PIM Register kernel virtual interface 610(see 611.Xr pim 4 ) , 612the Border-bit in the Register messages sent to the RP will be set. 613.Pp 614The remaining six bits are reserved for future usage. 615.Pp 616The 617.Va mfcc_rp 618field is used to specify the RP address (in case of PIM-SM multicast routing) 619for a multicast 620group G if we want to perform kernel-level PIM Register encapsulation. 621The 622.Va mfcc_rp 623field is used only if the 624.Dv MRT_MFC_RP 625advanced API flag/capability has been successfully set by 626.Fn setsockopt MRT_API_CONFIG . 627.Pp 628.\" 629.\" 3. Kernel-level PIM Register encapsulation 630.\" 631If the 632.Dv MRT_MFC_RP 633flag was successfully set by 634.Fn setsockopt MRT_API_CONFIG , 635then the kernel will attempt to perform 636the PIM Register encapsulation itself instead of sending the 637multicast data packets to user level (inside 638.Dv IGMPMSG_WHOLEPKT 639upcalls) for user-level encapsulation. 640The RP address would be taken from the 641.Va mfcc_rp 642field 643inside the new 644.Vt "struct mfcctl2" . 645However, even if the 646.Dv MRT_MFC_RP 647flag was successfully set, if the 648.Va mfcc_rp 649field was set to 650.Dv INADDR_ANY , 651then the 652kernel will still deliver an 653.Dv IGMPMSG_WHOLEPKT 654upcall with the 655multicast data packet to the user-level process. 656.Pp 657In addition, if the multicast data packet is too large to fit within 658a single IP packet after the PIM Register encapsulation (e.g., if 659its size was on the order of 65500 bytes), the data packet will be 660fragmented, and then each of the fragments will be encapsulated 661separately. 662Note that typically a multicast data packet can be that 663large only if it was originated locally from the same hosts that 664performs the encapsulation; otherwise the transmission of the 665multicast data packet over Ethernet for example would have 666fragmented it into much smaller pieces. 667.\" 668.\" Note that if this code is ported to IPv6, we may need the kernel to 669.\" perform MTU discovery to the RP, and keep those discoveries inside 670.\" the kernel so the encapsulating router may send back ICMP 671.\" Fragmentation Required if the size of the multicast data packet is 672.\" too large (see "Encapsulating data packets in the Register Tunnel" 673.\" in Section 4.4.1 in the PIM-SM spec 674.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}). 675.\" For IPv4 we may be able to get away without it, but for IPv6 we need 676.\" that. 677.\" 678.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls". 679.\" 680.Pp 681Typically, a multicast routing user-level process would need to know the 682forwarding bandwidth for some data flow. 683For example, the multicast routing process may want to timeout idle MFC 684entries, or in case of PIM-SM it can initiate (S,G) shortest-path switch if 685the bandwidth rate is above a threshold for example. 686.Pp 687The original solution for measuring the bandwidth of a dataflow was 688that a user-level process would periodically 689query the kernel about the number of forwarded packets/bytes per 690(S,G), and then based on those numbers it would estimate whether a source 691has been idle, or whether the source's transmission bandwidth is above a 692threshold. 693That solution is far from being scalable, hence the need for a new 694mechanism for bandwidth monitoring. 695.Pp 696Below is a description of the bandwidth monitoring mechanism. 697.Bl -bullet 698.It 699If the bandwidth of a data flow satisfies some pre-defined filter, 700the kernel delivers an upcall on the multicast routing socket 701to the multicast routing process that has installed that filter. 702.It 703The bandwidth-upcall filters are installed per (S,G). 704There can be 705more than one filter per (S,G). 706.It 707Instead of supporting all possible comparison operations 708(i.e., < <= == != > >= ), there is support only for the 709<= and >= operations, 710because this makes the kernel-level implementation simpler, 711and because practically we need only those two. 712Further, the missing operations can be simulated by secondary 713user-level filtering of those <= and >= filters. 714For example, to simulate !=, then we need to install filter 715.Dq bw <= 0xffffffff , 716and after an 717upcall is received, we need to check whether 718.Dq measured_bw != expected_bw . 719.It 720The bandwidth-upcall mechanism is enabled by 721.Fn setsockopt MRT_API_CONFIG 722for the 723.Dv MRT_MFC_BW_UPCALL 724flag. 725.It 726The bandwidth-upcall filters are added/deleted by the new 727.Fn setsockopt MRT_ADD_BW_UPCALL 728and 729.Fn setsockopt MRT_DEL_BW_UPCALL 730respectively (with the appropriate 731.Vt "struct bw_upcall" 732argument of course). 733.El 734.Pp 735From application point of view, a developer needs to know about 736the following: 737.Bd -literal 738/* 739 * Structure for installing or delivering an upcall if the 740 * measured bandwidth is above or below a threshold. 741 * 742 * User programs (e.g. daemons) may have a need to know when the 743 * bandwidth used by some data flow is above or below some threshold. 744 * This interface allows the userland to specify the threshold (in 745 * bytes and/or packets) and the measurement interval. Flows are 746 * all packet with the same source and destination IP address. 747 * At the moment the code is only used for multicast destinations 748 * but there is nothing that prevents its use for unicast. 749 * 750 * The measurement interval cannot be shorter than some Tmin (currently, 3s). 751 * The threshold is set in packets and/or bytes per_interval. 752 * 753 * Measurement works as follows: 754 * 755 * For >= measurements: 756 * The first packet marks the start of a measurement interval. 757 * During an interval we count packets and bytes, and when we 758 * pass the threshold we deliver an upcall and we are done. 759 * The first packet after the end of the interval resets the 760 * count and restarts the measurement. 761 * 762 * For <= measurement: 763 * We start a timer to fire at the end of the interval, and 764 * then for each incoming packet we count packets and bytes. 765 * When the timer fires, we compare the value with the threshold, 766 * schedule an upcall if we are below, and restart the measurement 767 * (reschedule timer and zero counters). 768 */ 769 770struct bw_data { 771 struct timeval b_time; 772 uint64_t b_packets; 773 uint64_t b_bytes; 774}; 775 776struct bw_upcall { 777 struct in_addr bu_src; /* source address */ 778 struct in_addr bu_dst; /* destination address */ 779 uint32_t bu_flags; /* misc flags (see below) */ 780#define BW_UPCALL_UNIT_PACKETS (1 << 0) /* threshold (in packets) */ 781#define BW_UPCALL_UNIT_BYTES (1 << 1) /* threshold (in bytes) */ 782#define BW_UPCALL_GEQ (1 << 2) /* upcall if bw >= threshold */ 783#define BW_UPCALL_LEQ (1 << 3) /* upcall if bw <= threshold */ 784#define BW_UPCALL_DELETE_ALL (1 << 4) /* delete all upcalls for s,d*/ 785 struct bw_data bu_threshold; /* the bw threshold */ 786 struct bw_data bu_measured; /* the measured bw */ 787}; 788 789/* max. number of upcalls to deliver together */ 790#define BW_UPCALLS_MAX 128 791/* min. threshold time interval for bandwidth measurement */ 792#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC 3 793#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC 0 794.Ed 795.Pp 796The 797.Vt bw_upcall 798structure is used as an argument to 799.Fn setsockopt MRT_ADD_BW_UPCALL 800and 801.Fn setsockopt MRT_DEL_BW_UPCALL . 802Each 803.Fn setsockopt MRT_ADD_BW_UPCALL 804installs a filter in the kernel 805for the source and destination address in the 806.Vt bw_upcall 807argument, 808and that filter will trigger an upcall according to the following 809pseudo-algorithm: 810.Bd -literal 811 if (bw_upcall_oper IS ">=") { 812 if (((bw_upcall_unit & PACKETS == PACKETS) && 813 (measured_packets >= threshold_packets)) || 814 ((bw_upcall_unit & BYTES == BYTES) && 815 (measured_bytes >= threshold_bytes))) 816 SEND_UPCALL("measured bandwidth is >= threshold"); 817 } 818 if (bw_upcall_oper IS "<=" && measured_interval >= threshold_interval) { 819 if (((bw_upcall_unit & PACKETS == PACKETS) && 820 (measured_packets <= threshold_packets)) || 821 ((bw_upcall_unit & BYTES == BYTES) && 822 (measured_bytes <= threshold_bytes))) 823 SEND_UPCALL("measured bandwidth is <= threshold"); 824 } 825.Ed 826.Pp 827In the same 828.Vt bw_upcall 829the unit can be specified in both BYTES and PACKETS. 830However, the GEQ and LEQ flags are mutually exclusive. 831.Pp 832Basically, an upcall is delivered if the measured bandwidth is >= or 833<= the threshold bandwidth (within the specified measurement 834interval). 835For practical reasons, the smallest value for the measurement 836interval is 3 seconds. 837If smaller values are allowed, then the bandwidth 838estimation may be less accurate, or the potentially very high frequency 839of the generated upcalls may introduce too much overhead. 840For the >= operation, the answer may be known before the end of 841.Va threshold_interval , 842therefore the upcall may be delivered earlier. 843For the <= operation however, we must wait 844until the threshold interval has expired to know the answer. 845.Pp 846Example of usage: 847.Bd -literal 848struct bw_upcall bw_upcall; 849/* Assign all bw_upcall fields as appropriate */ 850memset(&bw_upcall, 0, sizeof(bw_upcall)); 851memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src)); 852memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst)); 853bw_upcall.bu_threshold.b_data = threshold_interval; 854bw_upcall.bu_threshold.b_packets = threshold_packets; 855bw_upcall.bu_threshold.b_bytes = threshold_bytes; 856if (is_threshold_in_packets) 857 bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS; 858if (is_threshold_in_bytes) 859 bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES; 860do { 861 if (is_geq_upcall) { 862 bw_upcall.bu_flags |= BW_UPCALL_GEQ; 863 break; 864 } 865 if (is_leq_upcall) { 866 bw_upcall.bu_flags |= BW_UPCALL_LEQ; 867 break; 868 } 869 return (ERROR); 870} while (0); 871setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL, 872 (void *)&bw_upcall, sizeof(bw_upcall)); 873.Ed 874.Pp 875To delete a single filter, then use 876.Dv MRT_DEL_BW_UPCALL , 877and the fields of bw_upcall must be set 878exactly same as when 879.Dv MRT_ADD_BW_UPCALL 880was called. 881.Pp 882To delete all bandwidth filters for a given (S,G), then 883only the 884.Va bu_src 885and 886.Va bu_dst 887fields in 888.Vt "struct bw_upcall" 889need to be set, and then just set only the 890.Dv BW_UPCALL_DELETE_ALL 891flag inside field 892.Va bw_upcall.bu_flags . 893.Pp 894The bandwidth upcalls are received by aggregating them in the new upcall 895message: 896.Bd -literal 897#define IGMPMSG_BW_UPCALL 4 /* BW monitoring upcall */ 898.Ed 899.Pp 900This message is an array of 901.Vt "struct bw_upcall" 902elements (up to 903.Dv BW_UPCALLS_MAX 904= 128). 905The upcalls are 906delivered when there are 128 pending upcalls, or when 1 second has 907expired since the previous upcall (whichever comes first). 908In an 909.Vt "struct upcall" 910element, the 911.Va bu_measured 912field is filled-in to 913indicate the particular measured values. 914However, because of the way 915the particular intervals are measured, the user should be careful how 916.Va bu_measured.b_time 917is used. 918For example, if the 919filter is installed to trigger an upcall if the number of packets 920is >= 1, then 921.Va bu_measured 922may have a value of zero in the upcalls after the 923first one, because the measured interval for >= filters is 924.Dq clocked 925by the forwarded packets. 926Hence, this upcall mechanism should not be used for measuring 927the exact value of the bandwidth of the forwarded data. 928To measure the exact bandwidth, the user would need to 929get the forwarded packets statistics with the 930.Fn ioctl SIOCGETSGCNT 931mechanism 932(see the 933.Sx Programming Guide 934section) . 935.Pp 936Note that the upcalls for a filter are delivered until the specific 937filter is deleted, but no more frequently than once per 938.Va bu_threshold.b_time . 939For example, if the filter is specified to 940deliver a signal if bw >= 1 packet, the first packet will trigger a 941signal, but the next upcall will be triggered no earlier than 942.Va bu_threshold.b_time 943after the previous upcall. 944.\" 945.Sh SEE ALSO 946.Xr getsockopt 2 , 947.Xr recvfrom 2 , 948.Xr recvmsg 2 , 949.Xr setsockopt 2 , 950.Xr socket 2 , 951.Xr sourcefilter 3 , 952.Xr altq 4 , 953.Xr dummynet 4 , 954.Xr gif 4 , 955.Xr gre 4 , 956.Xr icmp6 4 , 957.Xr igmp 4 , 958.Xr inet 4 , 959.Xr inet6 4 , 960.Xr intro 4 , 961.Xr ip 4 , 962.Xr ip6 4 , 963.Xr mld 4 , 964.Xr pim 4 965.\" 966.Sh HISTORY 967The Distance Vector Multicast Routing Protocol (DVMRP) 968was the first developed multicast routing protocol. 969Later, other protocols such as Multicast Extensions to OSPF (MOSPF) 970and Core Based Trees (CBT), were developed as well. 971Routers at autonomous system boundaries may now exchange multicast 972routes with peers via the Border Gateway Protocol (BGP). 973Many other routing protocols are able to redistribute multicast routes 974for use with 975.Dv PIM-SM 976and 977.Dv PIM-DM . 978.Sh AUTHORS 979.An -nosplit 980The original multicast code was written by 981.An David Waitzman 982(BBN Labs), 983and later modified by the following individuals: 984.An Steve Deering 985(Stanford), 986.An Mark J. Steiglitz 987(Stanford), 988.An Van Jacobson 989(LBL), 990.An Ajit Thyagarajan 991(PARC), 992.An Bill Fenner 993(PARC). 994The IPv6 multicast support was implemented by the KAME project 995.Pq Pa https://www.kame.net , 996and was based on the IPv4 multicast code. 997The advanced multicast API and the multicast bandwidth 998monitoring were implemented by 999.An Pavlin Radoslavov 1000(ICSI) 1001in collaboration with 1002.An Chris Brown 1003(NextHop). 1004The IGMPv3 and MLDv2 multicast support was implemented by 1005.An Bruce Simpson . 1006.Pp 1007This manual page was written by 1008.An Pavlin Radoslavov 1009(ICSI). 1010