1.\" Copyright (c) 2001-2003 International Computer Science Institute 2.\" 3.\" Permission is hereby granted, free of charge, to any person obtaining a 4.\" copy of this software and associated documentation files (the "Software"), 5.\" to deal in the Software without restriction, including without limitation 6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, 7.\" and/or sell copies of the Software, and to permit persons to whom the 8.\" Software is furnished to do so, subject to the following conditions: 9.\" 10.\" The above copyright notice and this permission notice shall be included in 11.\" all copies or substantial portions of the Software. 12.\" 13.\" The names and trademarks of copyright holders may not be used in 14.\" advertising or publicity pertaining to the software without specific 15.\" prior permission. Title to copyright in this software and any associated 16.\" documentation will at all times remain with the copyright holders. 17.\" 18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24.\" DEALINGS IN THE SOFTWARE. 25.\" 26.\" $FreeBSD$ 27.\" 28.Dd September 4, 2003 29.Dt MULTICAST 4 30.Os 31.\" 32.Sh NAME 33.Nm multicast 34.Nd Multicast Routing 35.\" 36.Sh SYNOPSIS 37.Cd "options MROUTING" 38.Pp 39.In sys/types.h 40.In sys/socket.h 41.In netinet/in.h 42.In netinet/ip_mroute.h 43.In netinet6/ip6_mroute.h 44.Ft int 45.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen" 46.Ft int 47.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen" 48.Ft int 49.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen" 50.Ft int 51.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen" 52.Sh DESCRIPTION 53.Tn "Multicast routing" 54is used to efficiently propagate data 55packets to a set of multicast listeners in multipoint networks. 56If unicast is used to replicate the data to all listeners, 57then some of the network links may carry multiple copies of the same 58data packets. 59With multicast routing, the overhead is reduced to one copy 60(at most) per network link. 61.Pp 62All multicast-capable routers must run a common multicast routing 63protocol. 64The Distance Vector Multicast Routing Protocol (DVMRP) 65was the first developed multicast routing protocol. 66Later, other protocols such as Multicast Extensions to OSPF (MOSPF), 67Core Based Trees (CBT), 68Protocol Independent Multicast - Sparse Mode (PIM-SM), 69and Protocol Independent Multicast - Dense Mode (PIM-DM) 70were developed as well. 71.Pp 72To start multicast routing, 73the user must enable multicast forwarding in the kernel 74(see 75.Sx SYNOPSIS 76about the kernel configuration options), 77and must run a multicast routing capable user-level process. 78From developer's point of view, 79the programming guide described in the 80.Sx "Programming Guide" 81section should be used to control the multicast forwarding in the kernel. 82.\" 83.Ss Programming Guide 84This section provides information about the basic multicast routing API. 85The so-called 86.Dq advanced multicast API 87is described in the 88.Sx "Advanced Multicast API Programming Guide" 89section. 90.Pp 91First, a multicast routing socket must be open. 92That socket would be used 93to control the multicast forwarding in the kernel. 94Note that most operations below require certain privilege 95(i.e., root privilege): 96.Bd -literal 97/* IPv4 */ 98int mrouter_s4; 99mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP); 100.Ed 101.Bd -literal 102int mrouter_s6; 103mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); 104.Ed 105.Pp 106Note that if the router needs to open an IGMP or ICMPv6 socket 107(in case of IPv4 and IPv6 respectively) 108for sending or receiving of IGMP or MLD multicast group membership messages, 109then the same 110.Va mrouter_s4 111or 112.Va mrouter_s6 113sockets should be used 114for sending and receiving respectively IGMP or MLD messages. 115In case of 116.Bx Ns 117-derived kernel, it may be possible to open separate sockets 118for IGMP or MLD messages only. 119However, some other kernels (e.g., 120.Tn Linux ) 121require that the multicast 122routing socket must be used for sending and receiving of IGMP or MLD 123messages. 124Therefore, for portability reason the multicast 125routing socket should be reused for IGMP and MLD messages as well. 126.Pp 127After the multicast routing socket is open, it can be used to enable 128or disable multicast forwarding in the kernel: 129.Bd -literal 130/* IPv4 */ 131int v = 1; /* 1 to enable, or 0 to disable */ 132setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v)); 133.Ed 134.Bd -literal 135/* IPv6 */ 136int v = 1; /* 1 to enable, or 0 to disable */ 137setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v)); 138\&... 139/* If necessary, filter all ICMPv6 messages */ 140struct icmp6_filter filter; 141ICMP6_FILTER_SETBLOCKALL(&filter); 142setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter, 143 sizeof(filter)); 144.Ed 145.Pp 146After multicast forwarding is enabled, the multicast routing socket 147can be used to enable PIM processing in the kernel if we are running PIM-SM or 148PIM-DM 149(see 150.Xr pim 4 ) . 151.Pp 152For each network interface (e.g., physical or a virtual tunnel) 153that would be used for multicast forwarding, a corresponding 154multicast interface must be added to the kernel: 155.Bd -literal 156/* IPv4 */ 157struct vifctl vc; 158memset(&vc, 0, sizeof(vc)); 159/* Assign all vifctl fields as appropriate */ 160vc.vifc_vifi = vif_index; 161vc.vifc_flags = vif_flags; 162vc.vifc_threshold = min_ttl_threshold; 163vc.vifc_rate_limit = max_rate_limit; 164memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr)); 165if (vc.vifc_flags & VIFF_TUNNEL) 166 memcpy(&vc.vifc_rmt_addr, &vif_remote_address, 167 sizeof(vc.vifc_rmt_addr)); 168setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc, 169 sizeof(vc)); 170.Ed 171.Pp 172The 173.Va vif_index 174must be unique per vif. 175The 176.Va vif_flags 177contains the 178.Dv VIFF_* 179flags as defined in 180.In netinet/ip_mroute.h . 181The 182.Va min_ttl_threshold 183contains the minimum TTL a multicast data packet must have to be 184forwarded on that vif. 185Typically, it would have value of 1. 186The 187.Va max_rate_limit 188contains the maximum rate (in bits/s) of the multicast data packets forwarded 189on that vif. 190Value of 0 means no limit. 191The 192.Va vif_local_address 193contains the local IP address of the corresponding local interface. 194The 195.Va vif_remote_address 196contains the remote IP address in case of DVMRP multicast tunnels. 197.Bd -literal 198/* IPv6 */ 199struct mif6ctl mc; 200memset(&mc, 0, sizeof(mc)); 201/* Assign all mif6ctl fields as appropriate */ 202mc.mif6c_mifi = mif_index; 203mc.mif6c_flags = mif_flags; 204mc.mif6c_pifi = pif_index; 205setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc, 206 sizeof(mc)); 207.Ed 208.Pp 209The 210.Va mif_index 211must be unique per vif. 212The 213.Va mif_flags 214contains the 215.Dv MIFF_* 216flags as defined in 217.In netinet6/ip6_mroute.h . 218The 219.Va pif_index 220is the physical interface index of the corresponding local interface. 221.Pp 222A multicast interface is deleted by: 223.Bd -literal 224/* IPv4 */ 225vifi_t vifi = vif_index; 226setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi, 227 sizeof(vifi)); 228.Ed 229.Bd -literal 230/* IPv6 */ 231mifi_t mifi = mif_index; 232setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi, 233 sizeof(mifi)); 234.Ed 235.Pp 236After the multicast forwarding is enabled, and the multicast virtual 237interfaces are 238added, the kernel may deliver upcall messages (also called signals 239later in this text) on the multicast routing socket that was open 240earlier with 241.Dv MRT_INIT 242or 243.Dv MRT6_INIT . 244The IPv4 upcalls have 245.Vt "struct igmpmsg" 246header (see 247.In netinet/ip_mroute.h ) 248with field 249.Va im_mbz 250set to zero. 251Note that this header follows the structure of 252.Vt "struct ip" 253with the protocol field 254.Va ip_p 255set to zero. 256The IPv6 upcalls have 257.Vt "struct mrt6msg" 258header (see 259.In netinet6/ip6_mroute.h ) 260with field 261.Va im6_mbz 262set to zero. 263Note that this header follows the structure of 264.Vt "struct ip6_hdr" 265with the next header field 266.Va ip6_nxt 267set to zero. 268.Pp 269The upcall header contains field 270.Va im_msgtype 271and 272.Va im6_msgtype 273with the type of the upcall 274.Dv IGMPMSG_* 275and 276.Dv MRT6MSG_* 277for IPv4 and IPv6 respectively. 278The values of the rest of the upcall header fields 279and the body of the upcall message depend on the particular upcall type. 280.Pp 281If the upcall message type is 282.Dv IGMPMSG_NOCACHE 283or 284.Dv MRT6MSG_NOCACHE , 285this is an indication that a multicast packet has reached the multicast 286router, but the router has no forwarding state for that packet. 287Typically, the upcall would be a signal for the multicast routing 288user-level process to install the appropriate Multicast Forwarding 289Cache (MFC) entry in the kernel. 290.Pp 291An MFC entry is added by: 292.Bd -literal 293/* IPv4 */ 294struct mfcctl mc; 295memset(&mc, 0, sizeof(mc)); 296memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 297memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 298mc.mfcc_parent = iif_index; 299for (i = 0; i < maxvifs; i++) 300 mc.mfcc_ttls[i] = oifs_ttl[i]; 301setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, 302 (void *)&mc, sizeof(mc)); 303.Ed 304.Bd -literal 305/* IPv6 */ 306struct mf6cctl mc; 307memset(&mc, 0, sizeof(mc)); 308memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 309memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 310mc.mf6cc_parent = iif_index; 311for (i = 0; i < maxvifs; i++) 312 if (oifs_ttl[i] > 0) 313 IF_SET(i, &mc.mf6cc_ifset); 314setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC, 315 (void *)&mc, sizeof(mc)); 316.Ed 317.Pp 318The 319.Va source_addr 320and 321.Va group_addr 322are the source and group address of the multicast packet (as set 323in the upcall message). 324The 325.Va iif_index 326is the virtual interface index of the multicast interface the multicast 327packets for this specific source and group address should be received on. 328The 329.Va oifs_ttl[] 330array contains the minimum TTL (per interface) a multicast packet 331should have to be forwarded on an outgoing interface. 332If the TTL value is zero, the corresponding interface is not included 333in the set of outgoing interfaces. 334Note that in case of IPv6 only the set of outgoing interfaces can 335be specified. 336.Pp 337An MFC entry is deleted by: 338.Bd -literal 339/* IPv4 */ 340struct mfcctl mc; 341memset(&mc, 0, sizeof(mc)); 342memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 343memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 344setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, 345 (void *)&mc, sizeof(mc)); 346.Ed 347.Bd -literal 348/* IPv6 */ 349struct mf6cctl mc; 350memset(&mc, 0, sizeof(mc)); 351memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 352memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 353setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC, 354 (void *)&mc, sizeof(mc)); 355.Ed 356.Pp 357The following method can be used to get various statistics per 358installed MFC entry in the kernel (e.g., the number of forwarded 359packets per source and group address): 360.Bd -literal 361/* IPv4 */ 362struct sioc_sg_req sgreq; 363memset(&sgreq, 0, sizeof(sgreq)); 364memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 365memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 366ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq); 367.Ed 368.Bd -literal 369/* IPv6 */ 370struct sioc_sg_req6 sgreq; 371memset(&sgreq, 0, sizeof(sgreq)); 372memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 373memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 374ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq); 375.Ed 376.Pp 377The following method can be used to get various statistics per 378multicast virtual interface in the kernel (e.g., the number of forwarded 379packets per interface): 380.Bd -literal 381/* IPv4 */ 382struct sioc_vif_req vreq; 383memset(&vreq, 0, sizeof(vreq)); 384vreq.vifi = vif_index; 385ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq); 386.Ed 387.Bd -literal 388/* IPv6 */ 389struct sioc_mif_req6 mreq; 390memset(&mreq, 0, sizeof(mreq)); 391mreq.mifi = vif_index; 392ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq); 393.Ed 394.Ss Advanced Multicast API Programming Guide 395If we want to add new features in the kernel, it becomes difficult 396to preserve backward compatibility (binary and API), 397and at the same time to allow user-level processes to take advantage of 398the new features (if the kernel supports them). 399.Pp 400One of the mechanisms that allows us to preserve the backward 401compatibility is a sort of negotiation 402between the user-level process and the kernel: 403.Bl -enum 404.It 405The user-level process tries to enable in the kernel the set of new 406features (and the corresponding API) it would like to use. 407.It 408The kernel returns the (sub)set of features it knows about 409and is willing to be enabled. 410.It 411The user-level process uses only that set of features 412the kernel has agreed on. 413.El 414.\" 415.Pp 416To support backward compatibility, if the user-level process does not 417ask for any new features, the kernel defaults to the basic 418multicast API (see the 419.Sx "Programming Guide" 420section). 421.\" XXX: edit as appropriate after the advanced multicast API is 422.\" supported under IPv6 423Currently, the advanced multicast API exists only for IPv4; 424in the future there will be IPv6 support as well. 425.Pp 426Below is a summary of the expandable API solution. 427Note that all new options and structures are defined 428in 429.In netinet/ip_mroute.h 430and 431.In netinet6/ip6_mroute.h , 432unless stated otherwise. 433.Pp 434The user-level process uses new 435.Fn getsockopt Ns / Ns Fn setsockopt 436options to 437perform the API features negotiation with the kernel. 438This negotiation must be performed right after the multicast routing 439socket is open. 440The set of desired/allowed features is stored in a bitset 441(currently, in 442.Vt uint32_t ; 443i.e., maximum of 32 new features). 444The new 445.Fn getsockopt Ns / Ns Fn setsockopt 446options are 447.Dv MRT_API_SUPPORT 448and 449.Dv MRT_API_CONFIG . 450Example: 451.Bd -literal 452uint32_t v; 453getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v)); 454.Ed 455.Pp 456would set in 457.Va v 458the pre-defined bits that the kernel API supports. 459The eight least significant bits in 460.Vt uint32_t 461are same as the 462eight possible flags 463.Dv MRT_MFC_FLAGS_* 464that can be used in 465.Va mfcc_flags 466as part of the new definition of 467.Vt "struct mfcctl" 468(see below about those flags), which leaves 24 flags for other new features. 469The value returned by 470.Fn getsockopt MRT_API_SUPPORT 471is read-only; in other words, 472.Fn setsockopt MRT_API_SUPPORT 473would fail. 474.Pp 475To modify the API, and to set some specific feature in the kernel, then: 476.Bd -literal 477uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF; 478if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)) 479 != 0) { 480 return (ERROR); 481} 482if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF) 483 return (OK); /* Success */ 484else 485 return (ERROR); 486.Ed 487.Pp 488In other words, when 489.Fn setsockopt MRT_API_CONFIG 490is called, the 491argument to it specifies the desired set of features to 492be enabled in the API and the kernel. 493The return value in 494.Va v 495is the actual (sub)set of features that were enabled in the kernel. 496To obtain later the same set of features that were enabled, then: 497.Bd -literal 498getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)); 499.Ed 500.Pp 501The set of enabled features is global. 502In other words, 503.Fn setsockopt MRT_API_CONFIG 504should be called right after 505.Fn setsockopt MRT_INIT . 506.Pp 507Currently, the following set of new features is defined: 508.Bd -literal 509#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 510#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 511#define MRT_MFC_RP (1 << 8) /* enable RP address */ 512#define MRT_MFC_BW_UPCALL (1 << 9) /* enable bw upcalls */ 513.Ed 514.\" .Pp 515.\" In the future there might be: 516.\" .Bd -literal 517.\" #define MRT_MFC_GROUP_SPECIFIC (1 << 10) /* allow (*,G) MFC entries */ 518.\" .Ed 519.\" .Pp 520.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel. 521.\" For now this is left-out until it is clear whether 522.\" (*,G) MFC support is the preferred solution instead of something more generic 523.\" solution for example. 524.\" 525.\" 2. The newly defined struct mfcctl2. 526.\" 527.Pp 528The advanced multicast API uses a newly defined 529.Vt "struct mfcctl2" 530instead of the traditional 531.Vt "struct mfcctl" . 532The original 533.Vt "struct mfcctl" 534is kept as is. 535The new 536.Vt "struct mfcctl2" 537is: 538.Bd -literal 539/* 540 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays 541 * and extends the old struct mfcctl. 542 */ 543struct mfcctl2 { 544 /* the mfcctl fields */ 545 struct in_addr mfcc_origin; /* ip origin of mcasts */ 546 struct in_addr mfcc_mcastgrp; /* multicast group associated*/ 547 vifi_t mfcc_parent; /* incoming vif */ 548 u_char mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs */ 549 550 /* extension fields */ 551 uint8_t mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/ 552 struct in_addr mfcc_rp; /* the RP address */ 553}; 554.Ed 555.Pp 556The new fields are 557.Va mfcc_flags[MAXVIFS] 558and 559.Va mfcc_rp . 560Note that for compatibility reasons they are added at the end. 561.Pp 562The 563.Va mfcc_flags[MAXVIFS] 564field is used to set various flags per 565interface per (S,G) entry. 566Currently, the defined flags are: 567.Bd -literal 568#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 569#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 570.Ed 571.Pp 572The 573.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 574flag is used to explicitly disable the 575.Dv IGMPMSG_WRONGVIF 576kernel signal at the (S,G) granularity if a multicast data packet 577arrives on the wrong interface. 578Usually, this signal is used to 579complete the shortest-path switch in case of PIM-SM multicast routing, 580or to trigger a PIM assert message. 581However, it should not be delivered for interfaces that are not in 582the outgoing interface set, and that are not expecting to 583become an incoming interface. 584Hence, if the 585.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 586flag is set for some of the 587interfaces, then a data packet that arrives on that interface for 588that MFC entry will NOT trigger a WRONGVIF signal. 589If that flag is not set, then a signal is triggered (the default action). 590.Pp 591The 592.Dv MRT_MFC_FLAGS_BORDER_VIF 593flag is used to specify whether the Border-bit in PIM 594Register messages should be set (in case when the Register encapsulation 595is performed inside the kernel). 596If it is set for the special PIM Register kernel virtual interface 597(see 598.Xr pim 4 ) , 599the Border-bit in the Register messages sent to the RP will be set. 600.Pp 601The remaining six bits are reserved for future usage. 602.Pp 603The 604.Va mfcc_rp 605field is used to specify the RP address (in case of PIM-SM multicast routing) 606for a multicast 607group G if we want to perform kernel-level PIM Register encapsulation. 608The 609.Va mfcc_rp 610field is used only if the 611.Dv MRT_MFC_RP 612advanced API flag/capability has been successfully set by 613.Fn setsockopt MRT_API_CONFIG . 614.Pp 615.\" 616.\" 3. Kernel-level PIM Register encapsulation 617.\" 618If the 619.Dv MRT_MFC_RP 620flag was successfully set by 621.Fn setsockopt MRT_API_CONFIG , 622then the kernel will attempt to perform 623the PIM Register encapsulation itself instead of sending the 624multicast data packets to user level (inside 625.Dv IGMPMSG_WHOLEPKT 626upcalls) for user-level encapsulation. 627The RP address would be taken from the 628.Va mfcc_rp 629field 630inside the new 631.Vt "struct mfcctl2" . 632However, even if the 633.Dv MRT_MFC_RP 634flag was successfully set, if the 635.Va mfcc_rp 636field was set to 637.Dv INADDR_ANY , 638then the 639kernel will still deliver an 640.Dv IGMPMSG_WHOLEPKT 641upcall with the 642multicast data packet to the user-level process. 643.Pp 644In addition, if the multicast data packet is too large to fit within 645a single IP packet after the PIM Register encapsulation (e.g., if 646its size was on the order of 65500 bytes), the data packet will be 647fragmented, and then each of the fragments will be encapsulated 648separately. 649Note that typically a multicast data packet can be that 650large only if it was originated locally from the same hosts that 651performs the encapsulation; otherwise the transmission of the 652multicast data packet over Ethernet for example would have 653fragmented it into much smaller pieces. 654.\" 655.\" Note that if this code is ported to IPv6, we may need the kernel to 656.\" perform MTU discovery to the RP, and keep those discoveries inside 657.\" the kernel so the encapsulating router may send back ICMP 658.\" Fragmentation Required if the size of the multicast data packet is 659.\" too large (see "Encapsulating data packets in the Register Tunnel" 660.\" in Section 4.4.1 in the PIM-SM spec 661.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}). 662.\" For IPv4 we may be able to get away without it, but for IPv6 we need 663.\" that. 664.\" 665.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls". 666.\" 667.Pp 668Typically, a multicast routing user-level process would need to know the 669forwarding bandwidth for some data flow. 670For example, the multicast routing process may want to timeout idle MFC 671entries, or in case of PIM-SM it can initiate (S,G) shortest-path switch if 672the bandwidth rate is above a threshold for example. 673.Pp 674The original solution for measuring the bandwidth of a dataflow was 675that a user-level process would periodically 676query the kernel about the number of forwarded packets/bytes per 677(S,G), and then based on those numbers it would estimate whether a source 678has been idle, or whether the source's transmission bandwidth is above a 679threshold. 680That solution is far from being scalable, hence the need for a new 681mechanism for bandwidth monitoring. 682.Pp 683Below is a description of the bandwidth monitoring mechanism. 684.Bl -bullet 685.It 686If the bandwidth of a data flow satisfies some pre-defined filter, 687the kernel delivers an upcall on the multicast routing socket 688to the multicast routing process that has installed that filter. 689.It 690The bandwidth-upcall filters are installed per (S,G). 691There can be 692more than one filter per (S,G). 693.It 694Instead of supporting all possible comparison operations 695(i.e., < <= == != > >= ), there is support only for the 696<= and >= operations, 697because this makes the kernel-level implementation simpler, 698and because practically we need only those two. 699Further, the missing operations can be simulated by secondary 700user-level filtering of those <= and >= filters. 701For example, to simulate !=, then we need to install filter 702.Dq bw <= 0xffffffff , 703and after an 704upcall is received, we need to check whether 705.Dq measured_bw != expected_bw . 706.It 707The bandwidth-upcall mechanism is enabled by 708.Fn setsockopt MRT_API_CONFIG 709for the 710.Dv MRT_MFC_BW_UPCALL 711flag. 712.It 713The bandwidth-upcall filters are added/deleted by the new 714.Fn setsockopt MRT_ADD_BW_UPCALL 715and 716.Fn setsockopt MRT_DEL_BW_UPCALL 717respectively (with the appropriate 718.Vt "struct bw_upcall" 719argument of course). 720.El 721.Pp 722From application point of view, a developer needs to know about 723the following: 724.Bd -literal 725/* 726 * Structure for installing or delivering an upcall if the 727 * measured bandwidth is above or below a threshold. 728 * 729 * User programs (e.g. daemons) may have a need to know when the 730 * bandwidth used by some data flow is above or below some threshold. 731 * This interface allows the userland to specify the threshold (in 732 * bytes and/or packets) and the measurement interval. Flows are 733 * all packet with the same source and destination IP address. 734 * At the moment the code is only used for multicast destinations 735 * but there is nothing that prevents its use for unicast. 736 * 737 * The measurement interval cannot be shorter than some Tmin (currently, 3s). 738 * The threshold is set in packets and/or bytes per_interval. 739 * 740 * Measurement works as follows: 741 * 742 * For >= measurements: 743 * The first packet marks the start of a measurement interval. 744 * During an interval we count packets and bytes, and when we 745 * pass the threshold we deliver an upcall and we are done. 746 * The first packet after the end of the interval resets the 747 * count and restarts the measurement. 748 * 749 * For <= measurement: 750 * We start a timer to fire at the end of the interval, and 751 * then for each incoming packet we count packets and bytes. 752 * When the timer fires, we compare the value with the threshold, 753 * schedule an upcall if we are below, and restart the measurement 754 * (reschedule timer and zero counters). 755 */ 756 757struct bw_data { 758 struct timeval b_time; 759 uint64_t b_packets; 760 uint64_t b_bytes; 761}; 762 763struct bw_upcall { 764 struct in_addr bu_src; /* source address */ 765 struct in_addr bu_dst; /* destination address */ 766 uint32_t bu_flags; /* misc flags (see below) */ 767#define BW_UPCALL_UNIT_PACKETS (1 << 0) /* threshold (in packets) */ 768#define BW_UPCALL_UNIT_BYTES (1 << 1) /* threshold (in bytes) */ 769#define BW_UPCALL_GEQ (1 << 2) /* upcall if bw >= threshold */ 770#define BW_UPCALL_LEQ (1 << 3) /* upcall if bw <= threshold */ 771#define BW_UPCALL_DELETE_ALL (1 << 4) /* delete all upcalls for s,d*/ 772 struct bw_data bu_threshold; /* the bw threshold */ 773 struct bw_data bu_measured; /* the measured bw */ 774}; 775 776/* max. number of upcalls to deliver together */ 777#define BW_UPCALLS_MAX 128 778/* min. threshold time interval for bandwidth measurement */ 779#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC 3 780#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC 0 781.Ed 782.Pp 783The 784.Vt bw_upcall 785structure is used as an argument to 786.Fn setsockopt MRT_ADD_BW_UPCALL 787and 788.Fn setsockopt MRT_DEL_BW_UPCALL . 789Each 790.Fn setsockopt MRT_ADD_BW_UPCALL 791installs a filter in the kernel 792for the source and destination address in the 793.Vt bw_upcall 794argument, 795and that filter will trigger an upcall according to the following 796pseudo-algorithm: 797.Bd -literal 798 if (bw_upcall_oper IS ">=") { 799 if (((bw_upcall_unit & PACKETS == PACKETS) && 800 (measured_packets >= threshold_packets)) || 801 ((bw_upcall_unit & BYTES == BYTES) && 802 (measured_bytes >= threshold_bytes))) 803 SEND_UPCALL("measured bandwidth is >= threshold"); 804 } 805 if (bw_upcall_oper IS "<=" && measured_interval >= threshold_interval) { 806 if (((bw_upcall_unit & PACKETS == PACKETS) && 807 (measured_packets <= threshold_packets)) || 808 ((bw_upcall_unit & BYTES == BYTES) && 809 (measured_bytes <= threshold_bytes))) 810 SEND_UPCALL("measured bandwidth is <= threshold"); 811 } 812.Ed 813.Pp 814In the same 815.Vt bw_upcall 816the unit can be specified in both BYTES and PACKETS. 817However, the GEQ and LEQ flags are mutually exclusive. 818.Pp 819Basically, an upcall is delivered if the measured bandwidth is >= or 820<= the threshold bandwidth (within the specified measurement 821interval). 822For practical reasons, the smallest value for the measurement 823interval is 3 seconds. 824If smaller values are allowed, then the bandwidth 825estimation may be less accurate, or the potentially very high frequency 826of the generated upcalls may introduce too much overhead. 827For the >= operation, the answer may be known before the end of 828.Va threshold_interval , 829therefore the upcall may be delivered earlier. 830For the <= operation however, we must wait 831until the threshold interval has expired to know the answer. 832.Pp 833Example of usage: 834.Bd -literal 835struct bw_upcall bw_upcall; 836/* Assign all bw_upcall fields as appropriate */ 837memset(&bw_upcall, 0, sizeof(bw_upcall)); 838memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src)); 839memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst)); 840bw_upcall.bu_threshold.b_data = threshold_interval; 841bw_upcall.bu_threshold.b_packets = threshold_packets; 842bw_upcall.bu_threshold.b_bytes = threshold_bytes; 843if (is_threshold_in_packets) 844 bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS; 845if (is_threshold_in_bytes) 846 bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES; 847do { 848 if (is_geq_upcall) { 849 bw_upcall.bu_flags |= BW_UPCALL_GEQ; 850 break; 851 } 852 if (is_leq_upcall) { 853 bw_upcall.bu_flags |= BW_UPCALL_LEQ; 854 break; 855 } 856 return (ERROR); 857} while (0); 858setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL, 859 (void *)&bw_upcall, sizeof(bw_upcall)); 860.Ed 861.Pp 862To delete a single filter, then use 863.Dv MRT_DEL_BW_UPCALL , 864and the fields of bw_upcall must be set 865exactly same as when 866.Dv MRT_ADD_BW_UPCALL 867was called. 868.Pp 869To delete all bandwidth filters for a given (S,G), then 870only the 871.Va bu_src 872and 873.Va bu_dst 874fields in 875.Vt "struct bw_upcall" 876need to be set, and then just set only the 877.Dv BW_UPCALL_DELETE_ALL 878flag inside field 879.Va bw_upcall.bu_flags . 880.Pp 881The bandwidth upcalls are received by aggregating them in the new upcall 882message: 883.Bd -literal 884#define IGMPMSG_BW_UPCALL 4 /* BW monitoring upcall */ 885.Ed 886.Pp 887This message is an array of 888.Vt "struct bw_upcall" 889elements (up to 890.Dv BW_UPCALLS_MAX 891= 128). 892The upcalls are 893delivered when there are 128 pending upcalls, or when 1 second has 894expired since the previous upcall (whichever comes first). 895In an 896.Vt "struct upcall" 897element, the 898.Va bu_measured 899field is filled-in to 900indicate the particular measured values. 901However, because of the way 902the particular intervals are measured, the user should be careful how 903.Va bu_measured.b_time 904is used. 905For example, if the 906filter is installed to trigger an upcall if the number of packets 907is >= 1, then 908.Va bu_measured 909may have a value of zero in the upcalls after the 910first one, because the measured interval for >= filters is 911.Dq clocked 912by the forwarded packets. 913Hence, this upcall mechanism should not be used for measuring 914the exact value of the bandwidth of the forwarded data. 915To measure the exact bandwidth, the user would need to 916get the forwarded packets statistics with the 917.Fn ioctl SIOCGETSGCNT 918mechanism 919(see the 920.Sx Programming Guide 921section) . 922.Pp 923Note that the upcalls for a filter are delivered until the specific 924filter is deleted, but no more frequently than once per 925.Va bu_threshold.b_time . 926For example, if the filter is specified to 927deliver a signal if bw >= 1 packet, the first packet will trigger a 928signal, but the next upcall will be triggered no earlier than 929.Va bu_threshold.b_time 930after the previous upcall. 931.\" 932.Sh SEE ALSO 933.Xr getsockopt 2 , 934.Xr recvfrom 2 , 935.Xr recvmsg 2 , 936.Xr setsockopt 2 , 937.Xr socket 2 , 938.Xr icmp6 4 , 939.Xr inet 4 , 940.Xr inet6 4 , 941.Xr intro 4 , 942.Xr ip 4 , 943.Xr ip6 4 , 944.Xr pim 4 945.\" 946.Sh AUTHORS 947.An -nosplit 948The original multicast code was written by 949.An David Waitzman 950(BBN Labs), 951and later modified by the following individuals: 952.An Steve Deering 953(Stanford), 954.An Mark J. Steiglitz 955(Stanford), 956.An Van Jacobson 957(LBL), 958.An Ajit Thyagarajan 959(PARC), 960.An Bill Fenner 961(PARC). 962The IPv6 multicast support was implemented by the KAME project 963.Pq Pa http://www.kame.net , 964and was based on the IPv4 multicast code. 965The advanced multicast API and the multicast bandwidth 966monitoring were implemented by 967.An Pavlin Radoslavov 968(ICSI) 969in collaboration with 970.An Chris Brown 971(NextHop). 972.Pp 973This manual page was written by 974.An Pavlin Radoslavov 975(ICSI). 976