xref: /freebsd/share/man/man4/multicast.4 (revision 63f537551380d2dab29fa402ad1269feae17e594)
1.\" Copyright (c) 2001-2003 International Computer Science Institute
2.\"
3.\" Permission is hereby granted, free of charge, to any person obtaining a
4.\" copy of this software and associated documentation files (the "Software"),
5.\" to deal in the Software without restriction, including without limitation
6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
7.\" and/or sell copies of the Software, and to permit persons to whom the
8.\" Software is furnished to do so, subject to the following conditions:
9.\"
10.\" The above copyright notice and this permission notice shall be included in
11.\" all copies or substantial portions of the Software.
12.\"
13.\" The names and trademarks of copyright holders may not be used in
14.\" advertising or publicity pertaining to the software without specific
15.\" prior permission. Title to copyright in this software and any associated
16.\" documentation will at all times remain with the copyright holders.
17.\"
18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
24.\" DEALINGS IN THE SOFTWARE.
25.\"
26.Dd May 27, 2009
27.Dt MULTICAST 4
28.Os
29.\"
30.Sh NAME
31.Nm multicast
32.Nd Multicast Routing
33.\"
34.Sh SYNOPSIS
35.Cd "options MROUTING"
36.Pp
37.In sys/types.h
38.In sys/socket.h
39.In netinet/in.h
40.In netinet/ip_mroute.h
41.In netinet6/ip6_mroute.h
42.Ft int
43.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen"
44.Ft int
45.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen"
46.Ft int
47.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen"
48.Ft int
49.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen"
50.Sh DESCRIPTION
51.Tn "Multicast routing"
52is used to efficiently propagate data
53packets to a set of multicast listeners in multipoint networks.
54If unicast is used to replicate the data to all listeners,
55then some of the network links may carry multiple copies of the same
56data packets.
57With multicast routing, the overhead is reduced to one copy
58(at most) per network link.
59.Pp
60All multicast-capable routers must run a common multicast routing
61protocol.
62It is recommended that either
63Protocol Independent Multicast - Sparse Mode (PIM-SM),
64or Protocol Independent Multicast - Dense Mode (PIM-DM)
65are used, as these are now the generally accepted protocols
66in the Internet community.
67The
68.Sx HISTORY
69section discusses previous multicast routing protocols.
70.Pp
71To start multicast routing,
72the user must enable multicast forwarding in the kernel
73(see
74.Sx SYNOPSIS
75about the kernel configuration options),
76and must run a multicast routing capable user-level process.
77From developer's point of view,
78the programming guide described in the
79.Sx "Programming Guide"
80section should be used to control the multicast forwarding in the kernel.
81.\"
82.Ss Programming Guide
83This section provides information about the basic multicast routing API.
84The so-called
85.Dq advanced multicast API
86is described in the
87.Sx "Advanced Multicast API Programming Guide"
88section.
89.Pp
90First, a multicast routing socket must be open.
91That socket would be used
92to control the multicast forwarding in the kernel.
93Note that most operations below require certain privilege
94(i.e., root privilege):
95.Bd -literal
96/* IPv4 */
97int mrouter_s4;
98mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);
99.Ed
100.Bd -literal
101int mrouter_s6;
102mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
103.Ed
104.Pp
105Note that if the router needs to open an IGMP or ICMPv6 socket
106(in case of IPv4 and IPv6 respectively)
107for sending or receiving of IGMP or MLD multicast group membership messages,
108then the same
109.Va mrouter_s4
110or
111.Va mrouter_s6
112sockets should be used
113for sending and receiving respectively IGMP or MLD messages.
114In case of
115.Bx Ns
116-derived kernel, it may be possible to open separate sockets
117for IGMP or MLD messages only.
118However, some other kernels (e.g.,
119.Tn Linux )
120require that the multicast
121routing socket must be used for sending and receiving of IGMP or MLD
122messages.
123Therefore, for portability reason the multicast
124routing socket should be reused for IGMP and MLD messages as well.
125.Pp
126After the multicast routing socket is open, it can be used to enable
127or disable multicast forwarding in the kernel:
128.Bd -literal
129/* IPv4 */
130int v = 1;        /* 1 to enable, or 0 to disable */
131setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v));
132.Ed
133.Bd -literal
134/* IPv6 */
135int v = 1;        /* 1 to enable, or 0 to disable */
136setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v));
137\&...
138/* If necessary, filter all ICMPv6 messages */
139struct icmp6_filter filter;
140ICMP6_FILTER_SETBLOCKALL(&filter);
141setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter,
142           sizeof(filter));
143.Ed
144.Pp
145After multicast forwarding is enabled, the multicast routing socket
146can be used to enable PIM processing in the kernel if we are running PIM-SM or
147PIM-DM
148(see
149.Xr pim 4 ) .
150.Pp
151For each network interface (e.g., physical or a virtual tunnel)
152that would be used for multicast forwarding, a corresponding
153multicast interface must be added to the kernel:
154.Bd -literal
155/* IPv4 */
156struct vifctl vc;
157memset(&vc, 0, sizeof(vc));
158/* Assign all vifctl fields as appropriate */
159vc.vifc_vifi = vif_index;
160vc.vifc_flags = vif_flags;
161vc.vifc_threshold = min_ttl_threshold;
162vc.vifc_rate_limit = 0;
163memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr));
164setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc,
165           sizeof(vc));
166.Ed
167.Pp
168The
169.Va vif_index
170must be unique per vif.
171The
172.Va vif_flags
173contains the
174.Dv VIFF_*
175flags as defined in
176.In netinet/ip_mroute.h .
177The
178.Dv VIFF_TUNNEL
179flag is no longer supported by
180.Fx .
181Users who wish to forward multicast datagrams over a tunnel should consider
182configuring a
183.Xr gif 4
184or
185.Xr gre 4
186tunnel and using it as a physical interface.
187.Pp
188The
189.Va min_ttl_threshold
190contains the minimum TTL a multicast data packet must have to be
191forwarded on that vif.
192Typically, it would have value of 1.
193.Pp
194The
195.Va max_rate_limit
196argument is no longer supported in
197.Fx
198and should be set to 0.
199Users who wish to rate-limit multicast datagrams should consider the use of
200.Xr dummynet 4
201or
202.Xr altq 4 .
203.Pp
204The
205.Va vif_local_address
206contains the local IP address of the corresponding local interface.
207The
208.Va vif_remote_address
209contains the remote IP address in case of DVMRP multicast tunnels.
210.Bd -literal
211/* IPv6 */
212struct mif6ctl mc;
213memset(&mc, 0, sizeof(mc));
214/* Assign all mif6ctl fields as appropriate */
215mc.mif6c_mifi = mif_index;
216mc.mif6c_flags = mif_flags;
217mc.mif6c_pifi = pif_index;
218setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc,
219           sizeof(mc));
220.Ed
221.Pp
222The
223.Va mif_index
224must be unique per vif.
225The
226.Va mif_flags
227contains the
228.Dv MIFF_*
229flags as defined in
230.In netinet6/ip6_mroute.h .
231The
232.Va pif_index
233is the physical interface index of the corresponding local interface.
234.Pp
235A multicast interface is deleted by:
236.Bd -literal
237/* IPv4 */
238vifi_t vifi = vif_index;
239setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi,
240           sizeof(vifi));
241.Ed
242.Bd -literal
243/* IPv6 */
244mifi_t mifi = mif_index;
245setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi,
246           sizeof(mifi));
247.Ed
248.Pp
249After the multicast forwarding is enabled, and the multicast virtual
250interfaces are
251added, the kernel may deliver upcall messages (also called signals
252later in this text) on the multicast routing socket that was open
253earlier with
254.Dv MRT_INIT
255or
256.Dv MRT6_INIT .
257The IPv4 upcalls have
258.Vt "struct igmpmsg"
259header (see
260.In netinet/ip_mroute.h )
261with field
262.Va im_mbz
263set to zero.
264Note that this header follows the structure of
265.Vt "struct ip"
266with the protocol field
267.Va ip_p
268set to zero.
269The IPv6 upcalls have
270.Vt "struct mrt6msg"
271header (see
272.In netinet6/ip6_mroute.h )
273with field
274.Va im6_mbz
275set to zero.
276Note that this header follows the structure of
277.Vt "struct ip6_hdr"
278with the next header field
279.Va ip6_nxt
280set to zero.
281.Pp
282The upcall header contains field
283.Va im_msgtype
284and
285.Va im6_msgtype
286with the type of the upcall
287.Dv IGMPMSG_*
288and
289.Dv MRT6MSG_*
290for IPv4 and IPv6 respectively.
291The values of the rest of the upcall header fields
292and the body of the upcall message depend on the particular upcall type.
293.Pp
294If the upcall message type is
295.Dv IGMPMSG_NOCACHE
296or
297.Dv MRT6MSG_NOCACHE ,
298this is an indication that a multicast packet has reached the multicast
299router, but the router has no forwarding state for that packet.
300Typically, the upcall would be a signal for the multicast routing
301user-level process to install the appropriate Multicast Forwarding
302Cache (MFC) entry in the kernel.
303.Pp
304An MFC entry is added by:
305.Bd -literal
306/* IPv4 */
307struct mfcctl mc;
308memset(&mc, 0, sizeof(mc));
309memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
310memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
311mc.mfcc_parent = iif_index;
312for (i = 0; i < maxvifs; i++)
313    mc.mfcc_ttls[i] = oifs_ttl[i];
314setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC,
315           (void *)&mc, sizeof(mc));
316.Ed
317.Bd -literal
318/* IPv6 */
319struct mf6cctl mc;
320memset(&mc, 0, sizeof(mc));
321memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
322memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
323mc.mf6cc_parent = iif_index;
324for (i = 0; i < maxvifs; i++)
325    if (oifs_ttl[i] > 0)
326        IF_SET(i, &mc.mf6cc_ifset);
327setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MFC,
328           (void *)&mc, sizeof(mc));
329.Ed
330.Pp
331The
332.Va source_addr
333and
334.Va group_addr
335are the source and group address of the multicast packet (as set
336in the upcall message).
337The
338.Va iif_index
339is the virtual interface index of the multicast interface the multicast
340packets for this specific source and group address should be received on.
341The
342.Va oifs_ttl[]
343array contains the minimum TTL (per interface) a multicast packet
344should have to be forwarded on an outgoing interface.
345If the TTL value is zero, the corresponding interface is not included
346in the set of outgoing interfaces.
347Note that in case of IPv6 only the set of outgoing interfaces can
348be specified.
349.Pp
350An MFC entry is deleted by:
351.Bd -literal
352/* IPv4 */
353struct mfcctl mc;
354memset(&mc, 0, sizeof(mc));
355memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
356memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
357setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC,
358           (void *)&mc, sizeof(mc));
359.Ed
360.Bd -literal
361/* IPv6 */
362struct mf6cctl mc;
363memset(&mc, 0, sizeof(mc));
364memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
365memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
366setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MFC,
367           (void *)&mc, sizeof(mc));
368.Ed
369.Pp
370The following method can be used to get various statistics per
371installed MFC entry in the kernel (e.g., the number of forwarded
372packets per source and group address):
373.Bd -literal
374/* IPv4 */
375struct sioc_sg_req sgreq;
376memset(&sgreq, 0, sizeof(sgreq));
377memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src));
378memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp));
379ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq);
380.Ed
381.Bd -literal
382/* IPv6 */
383struct sioc_sg_req6 sgreq;
384memset(&sgreq, 0, sizeof(sgreq));
385memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src));
386memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp));
387ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq);
388.Ed
389.Pp
390The following method can be used to get various statistics per
391multicast virtual interface in the kernel (e.g., the number of forwarded
392packets per interface):
393.Bd -literal
394/* IPv4 */
395struct sioc_vif_req vreq;
396memset(&vreq, 0, sizeof(vreq));
397vreq.vifi = vif_index;
398ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq);
399.Ed
400.Bd -literal
401/* IPv6 */
402struct sioc_mif_req6 mreq;
403memset(&mreq, 0, sizeof(mreq));
404mreq.mifi = vif_index;
405ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq);
406.Ed
407.Ss Advanced Multicast API Programming Guide
408If we want to add new features in the kernel, it becomes difficult
409to preserve backward compatibility (binary and API),
410and at the same time to allow user-level processes to take advantage of
411the new features (if the kernel supports them).
412.Pp
413One of the mechanisms that allows us to preserve the backward
414compatibility is a sort of negotiation
415between the user-level process and the kernel:
416.Bl -enum
417.It
418The user-level process tries to enable in the kernel the set of new
419features (and the corresponding API) it would like to use.
420.It
421The kernel returns the (sub)set of features it knows about
422and is willing to be enabled.
423.It
424The user-level process uses only that set of features
425the kernel has agreed on.
426.El
427.\"
428.Pp
429To support backward compatibility, if the user-level process does not
430ask for any new features, the kernel defaults to the basic
431multicast API (see the
432.Sx "Programming Guide"
433section).
434.\" XXX: edit as appropriate after the advanced multicast API is
435.\" supported under IPv6
436Currently, the advanced multicast API exists only for IPv4;
437in the future there will be IPv6 support as well.
438.Pp
439Below is a summary of the expandable API solution.
440Note that all new options and structures are defined
441in
442.In netinet/ip_mroute.h
443and
444.In netinet6/ip6_mroute.h ,
445unless stated otherwise.
446.Pp
447The user-level process uses new
448.Fn getsockopt Ns / Ns Fn setsockopt
449options to
450perform the API features negotiation with the kernel.
451This negotiation must be performed right after the multicast routing
452socket is open.
453The set of desired/allowed features is stored in a bitset
454(currently, in
455.Vt uint32_t ;
456i.e., maximum of 32 new features).
457The new
458.Fn getsockopt Ns / Ns Fn setsockopt
459options are
460.Dv MRT_API_SUPPORT
461and
462.Dv MRT_API_CONFIG .
463Example:
464.Bd -literal
465uint32_t v;
466getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v));
467.Ed
468.Pp
469would set in
470.Va v
471the pre-defined bits that the kernel API supports.
472The eight least significant bits in
473.Vt uint32_t
474are same as the
475eight possible flags
476.Dv MRT_MFC_FLAGS_*
477that can be used in
478.Va mfcc_flags
479as part of the new definition of
480.Vt "struct mfcctl"
481(see below about those flags), which leaves 24 flags for other new features.
482The value returned by
483.Fn getsockopt MRT_API_SUPPORT
484is read-only; in other words,
485.Fn setsockopt MRT_API_SUPPORT
486would fail.
487.Pp
488To modify the API, and to set some specific feature in the kernel, then:
489.Bd -literal
490uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF;
491if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v))
492    != 0) {
493    return (ERROR);
494}
495if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF)
496    return (OK);	/* Success */
497else
498    return (ERROR);
499.Ed
500.Pp
501In other words, when
502.Fn setsockopt MRT_API_CONFIG
503is called, the
504argument to it specifies the desired set of features to
505be enabled in the API and the kernel.
506The return value in
507.Va v
508is the actual (sub)set of features that were enabled in the kernel.
509To obtain later the same set of features that were enabled, then:
510.Bd -literal
511getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v));
512.Ed
513.Pp
514The set of enabled features is global.
515In other words,
516.Fn setsockopt MRT_API_CONFIG
517should be called right after
518.Fn setsockopt MRT_INIT .
519.Pp
520Currently, the following set of new features is defined:
521.Bd -literal
522#define	MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */
523#define	MRT_MFC_FLAGS_BORDER_VIF   (1 << 1)  /* border vif              */
524#define MRT_MFC_RP                 (1 << 8)  /* enable RP address	*/
525#define MRT_MFC_BW_UPCALL          (1 << 9)  /* enable bw upcalls	*/
526.Ed
527.\" .Pp
528.\" In the future there might be:
529.\" .Bd -literal
530.\" #define MRT_MFC_GROUP_SPECIFIC     (1 << 10) /* allow (*,G) MFC entries */
531.\" .Ed
532.\" .Pp
533.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel.
534.\" For now this is left-out until it is clear whether
535.\" (*,G) MFC support is the preferred solution instead of something more generic
536.\" solution for example.
537.\"
538.\" 2. The newly defined struct mfcctl2.
539.\"
540.Pp
541The advanced multicast API uses a newly defined
542.Vt "struct mfcctl2"
543instead of the traditional
544.Vt "struct mfcctl" .
545The original
546.Vt "struct mfcctl"
547is kept as is.
548The new
549.Vt "struct mfcctl2"
550is:
551.Bd -literal
552/*
553 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays
554 * and extends the old struct mfcctl.
555 */
556struct mfcctl2 {
557        /* the mfcctl fields */
558        struct in_addr  mfcc_origin;       /* ip origin of mcasts       */
559        struct in_addr  mfcc_mcastgrp;     /* multicast group associated*/
560        vifi_t          mfcc_parent;       /* incoming vif              */
561        u_char          mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs   */
562
563        /* extension fields */
564        uint8_t         mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/
565        struct in_addr  mfcc_rp;            /* the RP address           */
566};
567.Ed
568.Pp
569The new fields are
570.Va mfcc_flags[MAXVIFS]
571and
572.Va mfcc_rp .
573Note that for compatibility reasons they are added at the end.
574.Pp
575The
576.Va mfcc_flags[MAXVIFS]
577field is used to set various flags per
578interface per (S,G) entry.
579Currently, the defined flags are:
580.Bd -literal
581#define	MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */
582#define	MRT_MFC_FLAGS_BORDER_VIF       (1 << 1) /* border vif          */
583.Ed
584.Pp
585The
586.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF
587flag is used to explicitly disable the
588.Dv IGMPMSG_WRONGVIF
589kernel signal at the (S,G) granularity if a multicast data packet
590arrives on the wrong interface.
591Usually, this signal is used to
592complete the shortest-path switch in case of PIM-SM multicast routing,
593or to trigger a PIM assert message.
594However, it should not be delivered for interfaces that are not in
595the outgoing interface set, and that are not expecting to
596become an incoming interface.
597Hence, if the
598.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF
599flag is set for some of the
600interfaces, then a data packet that arrives on that interface for
601that MFC entry will NOT trigger a WRONGVIF signal.
602If that flag is not set, then a signal is triggered (the default action).
603.Pp
604The
605.Dv MRT_MFC_FLAGS_BORDER_VIF
606flag is used to specify whether the Border-bit in PIM
607Register messages should be set (in case when the Register encapsulation
608is performed inside the kernel).
609If it is set for the special PIM Register kernel virtual interface
610(see
611.Xr pim 4 ) ,
612the Border-bit in the Register messages sent to the RP will be set.
613.Pp
614The remaining six bits are reserved for future usage.
615.Pp
616The
617.Va mfcc_rp
618field is used to specify the RP address (in case of PIM-SM multicast routing)
619for a multicast
620group G if we want to perform kernel-level PIM Register encapsulation.
621The
622.Va mfcc_rp
623field is used only if the
624.Dv MRT_MFC_RP
625advanced API flag/capability has been successfully set by
626.Fn setsockopt MRT_API_CONFIG .
627.Pp
628.\"
629.\" 3. Kernel-level PIM Register encapsulation
630.\"
631If the
632.Dv MRT_MFC_RP
633flag was successfully set by
634.Fn setsockopt MRT_API_CONFIG ,
635then the kernel will attempt to perform
636the PIM Register encapsulation itself instead of sending the
637multicast data packets to user level (inside
638.Dv IGMPMSG_WHOLEPKT
639upcalls) for user-level encapsulation.
640The RP address would be taken from the
641.Va mfcc_rp
642field
643inside the new
644.Vt "struct mfcctl2" .
645However, even if the
646.Dv MRT_MFC_RP
647flag was successfully set, if the
648.Va mfcc_rp
649field was set to
650.Dv INADDR_ANY ,
651then the
652kernel will still deliver an
653.Dv IGMPMSG_WHOLEPKT
654upcall with the
655multicast data packet to the user-level process.
656.Pp
657In addition, if the multicast data packet is too large to fit within
658a single IP packet after the PIM Register encapsulation (e.g., if
659its size was on the order of 65500 bytes), the data packet will be
660fragmented, and then each of the fragments will be encapsulated
661separately.
662Note that typically a multicast data packet can be that
663large only if it was originated locally from the same hosts that
664performs the encapsulation; otherwise the transmission of the
665multicast data packet over Ethernet for example would have
666fragmented it into much smaller pieces.
667.\"
668.\" Note that if this code is ported to IPv6, we may need the kernel to
669.\" perform MTU discovery to the RP, and keep those discoveries inside
670.\" the kernel so the encapsulating router may send back ICMP
671.\" Fragmentation Required if the size of the multicast data packet is
672.\" too large (see "Encapsulating data packets in the Register Tunnel"
673.\" in Section 4.4.1 in the PIM-SM spec
674.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}).
675.\" For IPv4 we may be able to get away without it, but for IPv6 we need
676.\" that.
677.\"
678.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls".
679.\"
680.Pp
681Typically, a multicast routing user-level process would need to know the
682forwarding bandwidth for some data flow.
683For example, the multicast routing process may want to timeout idle MFC
684entries, or in case of PIM-SM it can initiate (S,G) shortest-path switch if
685the bandwidth rate is above a threshold for example.
686.Pp
687The original solution for measuring the bandwidth of a dataflow was
688that a user-level process would periodically
689query the kernel about the number of forwarded packets/bytes per
690(S,G), and then based on those numbers it would estimate whether a source
691has been idle, or whether the source's transmission bandwidth is above a
692threshold.
693That solution is far from being scalable, hence the need for a new
694mechanism for bandwidth monitoring.
695.Pp
696Below is a description of the bandwidth monitoring mechanism.
697.Bl -bullet
698.It
699If the bandwidth of a data flow satisfies some pre-defined filter,
700the kernel delivers an upcall on the multicast routing socket
701to the multicast routing process that has installed that filter.
702.It
703The bandwidth-upcall filters are installed per (S,G).
704There can be
705more than one filter per (S,G).
706.It
707Instead of supporting all possible comparison operations
708(i.e., < <= == != > >= ), there is support only for the
709<= and >= operations,
710because this makes the kernel-level implementation simpler,
711and because practically we need only those two.
712Further, the missing operations can be simulated by secondary
713user-level filtering of those <= and >= filters.
714For example, to simulate !=, then we need to install filter
715.Dq bw <= 0xffffffff ,
716and after an
717upcall is received, we need to check whether
718.Dq measured_bw != expected_bw .
719.It
720The bandwidth-upcall mechanism is enabled by
721.Fn setsockopt MRT_API_CONFIG
722for the
723.Dv MRT_MFC_BW_UPCALL
724flag.
725.It
726The bandwidth-upcall filters are added/deleted by the new
727.Fn setsockopt MRT_ADD_BW_UPCALL
728and
729.Fn setsockopt MRT_DEL_BW_UPCALL
730respectively (with the appropriate
731.Vt "struct bw_upcall"
732argument of course).
733.El
734.Pp
735From application point of view, a developer needs to know about
736the following:
737.Bd -literal
738/*
739 * Structure for installing or delivering an upcall if the
740 * measured bandwidth is above or below a threshold.
741 *
742 * User programs (e.g. daemons) may have a need to know when the
743 * bandwidth used by some data flow is above or below some threshold.
744 * This interface allows the userland to specify the threshold (in
745 * bytes and/or packets) and the measurement interval. Flows are
746 * all packet with the same source and destination IP address.
747 * At the moment the code is only used for multicast destinations
748 * but there is nothing that prevents its use for unicast.
749 *
750 * The measurement interval cannot be shorter than some Tmin (currently, 3s).
751 * The threshold is set in packets and/or bytes per_interval.
752 *
753 * Measurement works as follows:
754 *
755 * For >= measurements:
756 * The first packet marks the start of a measurement interval.
757 * During an interval we count packets and bytes, and when we
758 * pass the threshold we deliver an upcall and we are done.
759 * The first packet after the end of the interval resets the
760 * count and restarts the measurement.
761 *
762 * For <= measurement:
763 * We start a timer to fire at the end of the interval, and
764 * then for each incoming packet we count packets and bytes.
765 * When the timer fires, we compare the value with the threshold,
766 * schedule an upcall if we are below, and restart the measurement
767 * (reschedule timer and zero counters).
768 */
769
770struct bw_data {
771        struct timeval  b_time;
772        uint64_t        b_packets;
773        uint64_t        b_bytes;
774};
775
776struct bw_upcall {
777        struct in_addr  bu_src;         /* source address            */
778        struct in_addr  bu_dst;         /* destination address       */
779        uint32_t        bu_flags;       /* misc flags (see below)    */
780#define BW_UPCALL_UNIT_PACKETS (1 << 0) /* threshold (in packets)    */
781#define BW_UPCALL_UNIT_BYTES   (1 << 1) /* threshold (in bytes)      */
782#define BW_UPCALL_GEQ          (1 << 2) /* upcall if bw >= threshold */
783#define BW_UPCALL_LEQ          (1 << 3) /* upcall if bw <= threshold */
784#define BW_UPCALL_DELETE_ALL   (1 << 4) /* delete all upcalls for s,d*/
785        struct bw_data  bu_threshold;   /* the bw threshold          */
786        struct bw_data  bu_measured;    /* the measured bw           */
787};
788
789/* max. number of upcalls to deliver together */
790#define BW_UPCALLS_MAX				128
791/* min. threshold time interval for bandwidth measurement */
792#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC	3
793#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC	0
794.Ed
795.Pp
796The
797.Vt bw_upcall
798structure is used as an argument to
799.Fn setsockopt MRT_ADD_BW_UPCALL
800and
801.Fn setsockopt MRT_DEL_BW_UPCALL .
802Each
803.Fn setsockopt MRT_ADD_BW_UPCALL
804installs a filter in the kernel
805for the source and destination address in the
806.Vt bw_upcall
807argument,
808and that filter will trigger an upcall according to the following
809pseudo-algorithm:
810.Bd -literal
811 if (bw_upcall_oper IS ">=") {
812    if (((bw_upcall_unit & PACKETS == PACKETS) &&
813         (measured_packets >= threshold_packets)) ||
814        ((bw_upcall_unit & BYTES == BYTES) &&
815         (measured_bytes >= threshold_bytes)))
816       SEND_UPCALL("measured bandwidth is >= threshold");
817  }
818  if (bw_upcall_oper IS "<=" && measured_interval >= threshold_interval) {
819    if (((bw_upcall_unit & PACKETS == PACKETS) &&
820         (measured_packets <= threshold_packets)) ||
821        ((bw_upcall_unit & BYTES == BYTES) &&
822         (measured_bytes <= threshold_bytes)))
823       SEND_UPCALL("measured bandwidth is <= threshold");
824  }
825.Ed
826.Pp
827In the same
828.Vt bw_upcall
829the unit can be specified in both BYTES and PACKETS.
830However, the GEQ and LEQ flags are mutually exclusive.
831.Pp
832Basically, an upcall is delivered if the measured bandwidth is >= or
833<= the threshold bandwidth (within the specified measurement
834interval).
835For practical reasons, the smallest value for the measurement
836interval is 3 seconds.
837If smaller values are allowed, then the bandwidth
838estimation may be less accurate, or the potentially very high frequency
839of the generated upcalls may introduce too much overhead.
840For the >= operation, the answer may be known before the end of
841.Va threshold_interval ,
842therefore the upcall may be delivered earlier.
843For the <= operation however, we must wait
844until the threshold interval has expired to know the answer.
845.Pp
846Example of usage:
847.Bd -literal
848struct bw_upcall bw_upcall;
849/* Assign all bw_upcall fields as appropriate */
850memset(&bw_upcall, 0, sizeof(bw_upcall));
851memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src));
852memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst));
853bw_upcall.bu_threshold.b_data = threshold_interval;
854bw_upcall.bu_threshold.b_packets = threshold_packets;
855bw_upcall.bu_threshold.b_bytes = threshold_bytes;
856if (is_threshold_in_packets)
857    bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS;
858if (is_threshold_in_bytes)
859    bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES;
860do {
861    if (is_geq_upcall) {
862        bw_upcall.bu_flags |= BW_UPCALL_GEQ;
863        break;
864    }
865    if (is_leq_upcall) {
866        bw_upcall.bu_flags |= BW_UPCALL_LEQ;
867        break;
868    }
869    return (ERROR);
870} while (0);
871setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL,
872          (void *)&bw_upcall, sizeof(bw_upcall));
873.Ed
874.Pp
875To delete a single filter, then use
876.Dv MRT_DEL_BW_UPCALL ,
877and the fields of bw_upcall must be set
878exactly same as when
879.Dv MRT_ADD_BW_UPCALL
880was called.
881.Pp
882To delete all bandwidth filters for a given (S,G), then
883only the
884.Va bu_src
885and
886.Va bu_dst
887fields in
888.Vt "struct bw_upcall"
889need to be set, and then just set only the
890.Dv BW_UPCALL_DELETE_ALL
891flag inside field
892.Va bw_upcall.bu_flags .
893.Pp
894The bandwidth upcalls are received by aggregating them in the new upcall
895message:
896.Bd -literal
897#define IGMPMSG_BW_UPCALL  4  /* BW monitoring upcall */
898.Ed
899.Pp
900This message is an array of
901.Vt "struct bw_upcall"
902elements (up to
903.Dv BW_UPCALLS_MAX
904= 128).
905The upcalls are
906delivered when there are 128 pending upcalls, or when 1 second has
907expired since the previous upcall (whichever comes first).
908In an
909.Vt "struct upcall"
910element, the
911.Va bu_measured
912field is filled-in to
913indicate the particular measured values.
914However, because of the way
915the particular intervals are measured, the user should be careful how
916.Va bu_measured.b_time
917is used.
918For example, if the
919filter is installed to trigger an upcall if the number of packets
920is >= 1, then
921.Va bu_measured
922may have a value of zero in the upcalls after the
923first one, because the measured interval for >= filters is
924.Dq clocked
925by the forwarded packets.
926Hence, this upcall mechanism should not be used for measuring
927the exact value of the bandwidth of the forwarded data.
928To measure the exact bandwidth, the user would need to
929get the forwarded packets statistics with the
930.Fn ioctl SIOCGETSGCNT
931mechanism
932(see the
933.Sx Programming Guide
934section) .
935.Pp
936Note that the upcalls for a filter are delivered until the specific
937filter is deleted, but no more frequently than once per
938.Va bu_threshold.b_time .
939For example, if the filter is specified to
940deliver a signal if bw >= 1 packet, the first packet will trigger a
941signal, but the next upcall will be triggered no earlier than
942.Va bu_threshold.b_time
943after the previous upcall.
944.\"
945.Sh SEE ALSO
946.Xr getsockopt 2 ,
947.Xr recvfrom 2 ,
948.Xr recvmsg 2 ,
949.Xr setsockopt 2 ,
950.Xr socket 2 ,
951.Xr sourcefilter 3 ,
952.Xr altq 4 ,
953.Xr dummynet 4 ,
954.Xr gif 4 ,
955.Xr gre 4 ,
956.Xr icmp6 4 ,
957.Xr igmp 4 ,
958.Xr inet 4 ,
959.Xr inet6 4 ,
960.Xr intro 4 ,
961.Xr ip 4 ,
962.Xr ip6 4 ,
963.Xr mld 4 ,
964.Xr pim 4
965.\"
966.Sh HISTORY
967The Distance Vector Multicast Routing Protocol (DVMRP)
968was the first developed multicast routing protocol.
969Later, other protocols such as Multicast Extensions to OSPF (MOSPF)
970and Core Based Trees (CBT), were developed as well.
971Routers at autonomous system boundaries may now exchange multicast
972routes with peers via the Border Gateway Protocol (BGP).
973Many other routing protocols are able to redistribute multicast routes
974for use with
975.Dv PIM-SM
976and
977.Dv PIM-DM .
978.Sh AUTHORS
979.An -nosplit
980The original multicast code was written by
981.An David Waitzman
982(BBN Labs),
983and later modified by the following individuals:
984.An Steve Deering
985(Stanford),
986.An Mark J. Steiglitz
987(Stanford),
988.An Van Jacobson
989(LBL),
990.An Ajit Thyagarajan
991(PARC),
992.An Bill Fenner
993(PARC).
994The IPv6 multicast support was implemented by the KAME project
995.Pq Pa https://www.kame.net ,
996and was based on the IPv4 multicast code.
997The advanced multicast API and the multicast bandwidth
998monitoring were implemented by
999.An Pavlin Radoslavov
1000(ICSI)
1001in collaboration with
1002.An Chris Brown
1003(NextHop).
1004The IGMPv3 and MLDv2 multicast support was implemented by
1005.An Bruce Simpson .
1006.Pp
1007This manual page was written by
1008.An Pavlin Radoslavov
1009(ICSI).
1010