xref: /freebsd/share/man/man4/multicast.4 (revision 3fdbd8a07a2dcb8fe3cec19fc59ef064453e4755)
1.\" Copyright (c) 2001-2003 International Computer Science Institute
2.\"
3.\" Permission is hereby granted, free of charge, to any person obtaining a
4.\" copy of this software and associated documentation files (the "Software"),
5.\" to deal in the Software without restriction, including without limitation
6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
7.\" and/or sell copies of the Software, and to permit persons to whom the
8.\" Software is furnished to do so, subject to the following conditions:
9.\"
10.\" The above copyright notice and this permission notice shall be included in
11.\" all copies or substantial portions of the Software.
12.\"
13.\" The names and trademarks of copyright holders may not be used in
14.\" advertising or publicity pertaining to the software without specific
15.\" prior permission. Title to copyright in this software and any associated
16.\" documentation will at all times remain with the copyright holders.
17.\"
18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
24.\" DEALINGS IN THE SOFTWARE.
25.\"
26.Dd February 13, 2026
27.Dt MULTICAST 4
28.Os
29.\"
30.Sh NAME
31.Nm multicast
32.Nd Multicast Routing
33.\"
34.Sh SYNOPSIS
35.Cd "options MROUTING"
36.Pp
37.In sys/types.h
38.In sys/socket.h
39.In netinet/in.h
40.In netinet/ip_mroute.h
41.In netinet6/ip6_mroute.h
42.Ft int
43.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen"
44.Ft int
45.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen"
46.Ft int
47.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen"
48.Ft int
49.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen"
50.Sh DESCRIPTION
51.Tn "Multicast routing"
52is used to efficiently propagate data
53packets to a set of multicast listeners in multipoint networks.
54If unicast is used to replicate the data to all listeners,
55then some of the network links may carry multiple copies of the same
56data packets.
57With multicast routing, the overhead is reduced to one copy
58(at most) per network link.
59.Pp
60All multicast-capable routers must run a common multicast routing
61protocol.
62It is recommended that either
63Protocol Independent Multicast - Sparse Mode (PIM-SM),
64or Protocol Independent Multicast - Dense Mode (PIM-DM)
65are used, as these are now the generally accepted protocols
66in the Internet community.
67The
68.Sx HISTORY
69section discusses previous multicast routing protocols.
70.Pp
71To start multicast routing,
72the user must enable multicast forwarding in the kernel
73(see
74.Sx SYNOPSIS
75about the kernel configuration options),
76and must run a multicast routing capable user-level process.
77From developer's point of view,
78the programming guide described in the
79.Sx "Programming Guide"
80section should be used to control the multicast forwarding in the kernel.
81.\"
82.Ss Programming Guide
83This section provides information about the basic multicast routing API.
84The so-called
85.Dq advanced multicast API
86is described in the
87.Sx "Advanced Multicast API Programming Guide"
88section.
89.Pp
90First, a multicast routing socket must be open.
91That socket would be used
92to control the multicast forwarding in the kernel.
93Note that most operations below require certain privilege
94(i.e., root privilege):
95.Bd -literal
96/* IPv4 */
97int mrouter_s4;
98mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);
99.Ed
100.Bd -literal
101int mrouter_s6;
102mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
103.Ed
104.Pp
105Note that if the router needs to open an IGMP or ICMPv6 socket
106(in case of IPv4 and IPv6 respectively)
107for sending or receiving of IGMP or MLD multicast group membership messages,
108then the same
109.Va mrouter_s4
110or
111.Va mrouter_s6
112sockets should be used
113for sending and receiving respectively IGMP or MLD messages.
114In case of
115.Bx Ns
116-derived kernel, it may be possible to open separate sockets
117for IGMP or MLD messages only.
118However, some other kernels (e.g.,
119.Tn Linux )
120require that the multicast
121routing socket must be used for sending and receiving of IGMP or MLD
122messages.
123Therefore, for portability reason the multicast
124routing socket should be reused for IGMP and MLD messages as well.
125.Pp
126After the multicast routing socket is open, it can be used to enable
127multicast forwarding in the kernel:
128.Bd -literal
129/* IPv4 */
130int v = 1;
131setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v));
132.Ed
133.Bd -literal
134/* IPv6 */
135int v = 1;
136setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v));
137\&...
138/* If necessary, filter all ICMPv6 messages */
139struct icmp6_filter filter;
140ICMP6_FILTER_SETBLOCKALL(&filter);
141setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter,
142           sizeof(filter));
143.Ed
144.Pp
145When applied to the multicast routing socket, the
146.Dv MRT_DONE
147and
148.Dv MRT6_DONE
149socket options disable multicast forwarding in the kernel:
150.Bd -literal
151/* IPv4 */
152int v = 1;
153setsockopt(mrouter_s4, IPPROTO_IP, MRT_DONE, (void *)&v, sizeof(v));
154.Ed
155.Bd -literal
156/* IPv6 */
157int v = 1;
158setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DONE, (void *)&v, sizeof(v));
159.Ed
160.Pp
161Closing the socket has the same effect.
162.Pp
163After multicast forwarding is enabled, the multicast routing socket
164can be used to enable PIM processing in the kernel if we are running PIM-SM or
165PIM-DM
166(see
167.Xr pim 4 ) .
168.Pp
169For each network interface (e.g., physical or a virtual tunnel)
170that would be used for multicast forwarding, a corresponding
171multicast interface must be added to the kernel:
172.Bd -literal
173/* IPv4 */
174struct vifctl vc;
175memset(&vc, 0, sizeof(vc));
176/* Assign all vifctl fields as appropriate */
177vc.vifc_vifi = vif_index;
178vc.vifc_flags = vif_flags;
179vc.vifc_threshold = min_ttl_threshold;
180vc.vifc_rate_limit = 0;
181memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr));
182setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc,
183           sizeof(vc));
184.Ed
185.Pp
186The
187.Va vif_index
188must be unique per vif.
189The
190.Va vif_flags
191contains the
192.Dv VIFF_*
193flags as defined in
194.In netinet/ip_mroute.h .
195The
196.Dv VIFF_TUNNEL
197flag is no longer supported by
198.Fx .
199Users who wish to forward multicast datagrams over a tunnel should consider
200configuring a
201.Xr gif 4
202or
203.Xr gre 4
204tunnel and using it as a physical interface.
205.Pp
206The
207.Va min_ttl_threshold
208contains the minimum TTL a multicast data packet must have to be
209forwarded on that vif.
210Typically, it would have value of 1.
211.Pp
212The
213.Va max_rate_limit
214argument is no longer supported in
215.Fx
216and should be set to 0.
217Users who wish to rate-limit multicast datagrams should consider the use of
218.Xr dummynet 4
219or
220.Xr altq 4 .
221.Pp
222The
223.Va vif_local_address
224contains the local IP address of the corresponding local interface.
225The
226.Va vif_remote_address
227contains the remote IP address in case of DVMRP multicast tunnels.
228.Bd -literal
229/* IPv6 */
230struct mif6ctl mc;
231memset(&mc, 0, sizeof(mc));
232/* Assign all mif6ctl fields as appropriate */
233mc.mif6c_mifi = mif_index;
234mc.mif6c_flags = mif_flags;
235mc.mif6c_pifi = pif_index;
236setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc,
237           sizeof(mc));
238.Ed
239.Pp
240The
241.Va mif_index
242must be unique per vif.
243The
244.Va mif_flags
245contains the
246.Dv MIFF_*
247flags as defined in
248.In netinet6/ip6_mroute.h .
249The
250.Va pif_index
251is the physical interface index of the corresponding local interface.
252.Pp
253A multicast interface is deleted by:
254.Bd -literal
255/* IPv4 */
256vifi_t vifi = vif_index;
257setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi,
258           sizeof(vifi));
259.Ed
260.Bd -literal
261/* IPv6 */
262mifi_t mifi = mif_index;
263setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi,
264           sizeof(mifi));
265.Ed
266.Pp
267After the multicast forwarding is enabled, and the multicast virtual
268interfaces are
269added, the kernel may deliver upcall messages (also called signals
270later in this text) on the multicast routing socket that was open
271earlier with
272.Dv MRT_INIT
273or
274.Dv MRT6_INIT .
275The IPv4 upcalls have
276.Vt "struct igmpmsg"
277header (see
278.In netinet/ip_mroute.h )
279with field
280.Va im_mbz
281set to zero.
282Note that this header follows the structure of
283.Vt "struct ip"
284with the protocol field
285.Va ip_p
286set to zero.
287The IPv6 upcalls have
288.Vt "struct mrt6msg"
289header (see
290.In netinet6/ip6_mroute.h )
291with field
292.Va im6_mbz
293set to zero.
294Note that this header follows the structure of
295.Vt "struct ip6_hdr"
296with the next header field
297.Va ip6_nxt
298set to zero.
299.Pp
300The upcall header contains field
301.Va im_msgtype
302and
303.Va im6_msgtype
304with the type of the upcall
305.Dv IGMPMSG_*
306and
307.Dv MRT6MSG_*
308for IPv4 and IPv6 respectively.
309The values of the rest of the upcall header fields
310and the body of the upcall message depend on the particular upcall type.
311.Pp
312If the upcall message type is
313.Dv IGMPMSG_NOCACHE
314or
315.Dv MRT6MSG_NOCACHE ,
316this is an indication that a multicast packet has reached the multicast
317router, but the router has no forwarding state for that packet.
318Typically, the upcall would be a signal for the multicast routing
319user-level process to install the appropriate Multicast Forwarding
320Cache (MFC) entry in the kernel.
321.Pp
322An MFC entry is added by:
323.Bd -literal
324/* IPv4 */
325struct mfcctl mc;
326memset(&mc, 0, sizeof(mc));
327memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
328memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
329mc.mfcc_parent = iif_index;
330for (i = 0; i < maxvifs; i++)
331    mc.mfcc_ttls[i] = oifs_ttl[i];
332setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC,
333           (void *)&mc, sizeof(mc));
334.Ed
335.Bd -literal
336/* IPv6 */
337struct mf6cctl mc;
338memset(&mc, 0, sizeof(mc));
339memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
340memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
341mc.mf6cc_parent = iif_index;
342for (i = 0; i < maxvifs; i++)
343    if (oifs_ttl[i] > 0)
344        IF_SET(i, &mc.mf6cc_ifset);
345setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MFC,
346           (void *)&mc, sizeof(mc));
347.Ed
348.Pp
349The
350.Va source_addr
351and
352.Va group_addr
353are the source and group address of the multicast packet (as set
354in the upcall message).
355The
356.Va iif_index
357is the virtual interface index of the multicast interface the multicast
358packets for this specific source and group address should be received on.
359The
360.Va oifs_ttl[]
361array contains the minimum TTL (per interface) a multicast packet
362should have to be forwarded on an outgoing interface.
363If the TTL value is zero, the corresponding interface is not included
364in the set of outgoing interfaces.
365Note that in case of IPv6 only the set of outgoing interfaces can
366be specified.
367.Pp
368An MFC entry is deleted by:
369.Bd -literal
370/* IPv4 */
371struct mfcctl mc;
372memset(&mc, 0, sizeof(mc));
373memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
374memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
375setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC,
376           (void *)&mc, sizeof(mc));
377.Ed
378.Bd -literal
379/* IPv6 */
380struct mf6cctl mc;
381memset(&mc, 0, sizeof(mc));
382memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
383memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
384setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MFC,
385           (void *)&mc, sizeof(mc));
386.Ed
387.Pp
388The following method can be used to get various statistics per
389installed MFC entry in the kernel (e.g., the number of forwarded
390packets per source and group address):
391.Bd -literal
392/* IPv4 */
393struct sioc_sg_req sgreq;
394memset(&sgreq, 0, sizeof(sgreq));
395memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src));
396memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp));
397ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq);
398.Ed
399.Bd -literal
400/* IPv6 */
401struct sioc_sg_req6 sgreq;
402memset(&sgreq, 0, sizeof(sgreq));
403memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src));
404memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp));
405ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq);
406.Ed
407.Pp
408The following method can be used to get various statistics per
409multicast virtual interface in the kernel (e.g., the number of forwarded
410packets per interface):
411.Bd -literal
412/* IPv4 */
413struct sioc_vif_req vreq;
414memset(&vreq, 0, sizeof(vreq));
415vreq.vifi = vif_index;
416ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq);
417.Ed
418.Bd -literal
419/* IPv6 */
420struct sioc_mif_req6 mreq;
421memset(&mreq, 0, sizeof(mreq));
422mreq.mifi = vif_index;
423ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq);
424.Ed
425.Ss Advanced Multicast API Programming Guide
426If we want to add new features in the kernel, it becomes difficult
427to preserve backward compatibility (binary and API),
428and at the same time to allow user-level processes to take advantage of
429the new features (if the kernel supports them).
430.Pp
431One of the mechanisms that allows us to preserve the backward
432compatibility is a sort of negotiation
433between the user-level process and the kernel:
434.Bl -enum
435.It
436The user-level process tries to enable in the kernel the set of new
437features (and the corresponding API) it would like to use.
438.It
439The kernel returns the (sub)set of features it knows about
440and is willing to be enabled.
441.It
442The user-level process uses only that set of features
443the kernel has agreed on.
444.El
445.\"
446.Pp
447To support backward compatibility, if the user-level process does not
448ask for any new features, the kernel defaults to the basic
449multicast API (see the
450.Sx "Programming Guide"
451section).
452.\" XXX: edit as appropriate after the advanced multicast API is
453.\" supported under IPv6
454Currently, the advanced multicast API exists only for IPv4;
455in the future there will be IPv6 support as well.
456.Pp
457Below is a summary of the expandable API solution.
458Note that all new options and structures are defined
459in
460.In netinet/ip_mroute.h
461and
462.In netinet6/ip6_mroute.h ,
463unless stated otherwise.
464.Pp
465The user-level process uses new
466.Fn getsockopt Ns / Ns Fn setsockopt
467options to
468perform the API features negotiation with the kernel.
469This negotiation must be performed right after the multicast routing
470socket is open.
471The set of desired/allowed features is stored in a bitset
472(currently, in
473.Vt uint32_t ;
474i.e., maximum of 32 new features).
475The new
476.Fn getsockopt Ns / Ns Fn setsockopt
477options are
478.Dv MRT_API_SUPPORT
479and
480.Dv MRT_API_CONFIG .
481Example:
482.Bd -literal
483uint32_t v;
484getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v));
485.Ed
486.Pp
487would set in
488.Va v
489the pre-defined bits that the kernel API supports.
490The eight least significant bits in
491.Vt uint32_t
492are same as the
493eight possible flags
494.Dv MRT_MFC_FLAGS_*
495that can be used in
496.Va mfcc_flags
497as part of the new definition of
498.Vt "struct mfcctl"
499(see below about those flags), which leaves 24 flags for other new features.
500The value returned by
501.Fn getsockopt MRT_API_SUPPORT
502is read-only; in other words,
503.Fn setsockopt MRT_API_SUPPORT
504would fail.
505.Pp
506To modify the API, and to set some specific feature in the kernel, then:
507.Bd -literal
508uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF;
509if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v))
510    != 0) {
511    return (ERROR);
512}
513if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF)
514    return (OK);	/* Success */
515else
516    return (ERROR);
517.Ed
518.Pp
519In other words, when
520.Fn setsockopt MRT_API_CONFIG
521is called, the
522argument to it specifies the desired set of features to
523be enabled in the API and the kernel.
524The return value in
525.Va v
526is the actual (sub)set of features that were enabled in the kernel.
527To obtain later the same set of features that were enabled, then:
528.Bd -literal
529getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v));
530.Ed
531.Pp
532The set of enabled features is global.
533In other words,
534.Fn setsockopt MRT_API_CONFIG
535should be called right after
536.Fn setsockopt MRT_INIT .
537.Pp
538Currently, the following set of new features is defined:
539.Bd -literal
540#define	MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */
541#define	MRT_MFC_FLAGS_BORDER_VIF   (1 << 1)  /* border vif              */
542#define MRT_MFC_RP                 (1 << 8)  /* enable RP address	*/
543#define MRT_MFC_BW_UPCALL          (1 << 9)  /* enable bw upcalls	*/
544.Ed
545.\" .Pp
546.\" In the future there might be:
547.\" .Bd -literal
548.\" #define MRT_MFC_GROUP_SPECIFIC     (1 << 10) /* allow (*,G) MFC entries */
549.\" .Ed
550.\" .Pp
551.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel.
552.\" For now this is left-out until it is clear whether
553.\" (*,G) MFC support is the preferred solution instead of something more generic
554.\" solution for example.
555.\"
556.\" 2. The newly defined struct mfcctl2.
557.\"
558.Pp
559The advanced multicast API uses a newly defined
560.Vt "struct mfcctl2"
561instead of the traditional
562.Vt "struct mfcctl" .
563The original
564.Vt "struct mfcctl"
565is kept as is.
566The new
567.Vt "struct mfcctl2"
568is:
569.Bd -literal
570/*
571 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays
572 * and extends the old struct mfcctl.
573 */
574struct mfcctl2 {
575        /* the mfcctl fields */
576        struct in_addr  mfcc_origin;       /* ip origin of mcasts       */
577        struct in_addr  mfcc_mcastgrp;     /* multicast group associated*/
578        vifi_t          mfcc_parent;       /* incoming vif              */
579        u_char          mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs   */
580
581        /* extension fields */
582        uint8_t         mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/
583        struct in_addr  mfcc_rp;            /* the RP address           */
584};
585.Ed
586.Pp
587The new fields are
588.Va mfcc_flags[MAXVIFS]
589and
590.Va mfcc_rp .
591Note that for compatibility reasons they are added at the end.
592.Pp
593The
594.Va mfcc_flags[MAXVIFS]
595field is used to set various flags per
596interface per (S,G) entry.
597Currently, the defined flags are:
598.Bd -literal
599#define	MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */
600#define	MRT_MFC_FLAGS_BORDER_VIF       (1 << 1) /* border vif          */
601.Ed
602.Pp
603The
604.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF
605flag is used to explicitly disable the
606.Dv IGMPMSG_WRONGVIF
607kernel signal at the (S,G) granularity if a multicast data packet
608arrives on the wrong interface.
609Usually, this signal is used to
610complete the shortest-path switch in case of PIM-SM multicast routing,
611or to trigger a PIM assert message.
612However, it should not be delivered for interfaces that are not in
613the outgoing interface set, and that are not expecting to
614become an incoming interface.
615Hence, if the
616.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF
617flag is set for some of the
618interfaces, then a data packet that arrives on that interface for
619that MFC entry will NOT trigger a WRONGVIF signal.
620If that flag is not set, then a signal is triggered (the default action).
621.Pp
622The
623.Dv MRT_MFC_FLAGS_BORDER_VIF
624flag is used to specify whether the Border-bit in PIM
625Register messages should be set (in case when the Register encapsulation
626is performed inside the kernel).
627If it is set for the special PIM Register kernel virtual interface
628(see
629.Xr pim 4 ) ,
630the Border-bit in the Register messages sent to the RP will be set.
631.Pp
632The remaining six bits are reserved for future usage.
633.Pp
634The
635.Va mfcc_rp
636field is used to specify the RP address (in case of PIM-SM multicast routing)
637for a multicast
638group G if we want to perform kernel-level PIM Register encapsulation.
639The
640.Va mfcc_rp
641field is used only if the
642.Dv MRT_MFC_RP
643advanced API flag/capability has been successfully set by
644.Fn setsockopt MRT_API_CONFIG .
645.Pp
646.\"
647.\" 3. Kernel-level PIM Register encapsulation
648.\"
649If the
650.Dv MRT_MFC_RP
651flag was successfully set by
652.Fn setsockopt MRT_API_CONFIG ,
653then the kernel will attempt to perform
654the PIM Register encapsulation itself instead of sending the
655multicast data packets to user level (inside
656.Dv IGMPMSG_WHOLEPKT
657upcalls) for user-level encapsulation.
658The RP address would be taken from the
659.Va mfcc_rp
660field
661inside the new
662.Vt "struct mfcctl2" .
663However, even if the
664.Dv MRT_MFC_RP
665flag was successfully set, if the
666.Va mfcc_rp
667field was set to
668.Dv INADDR_ANY ,
669then the
670kernel will still deliver an
671.Dv IGMPMSG_WHOLEPKT
672upcall with the
673multicast data packet to the user-level process.
674.Pp
675In addition, if the multicast data packet is too large to fit within
676a single IP packet after the PIM Register encapsulation (e.g., if
677its size was on the order of 65500 bytes), the data packet will be
678fragmented, and then each of the fragments will be encapsulated
679separately.
680Note that typically a multicast data packet can be that
681large only if it was originated locally from the same hosts that
682performs the encapsulation; otherwise the transmission of the
683multicast data packet over Ethernet for example would have
684fragmented it into much smaller pieces.
685.\"
686.\" Note that if this code is ported to IPv6, we may need the kernel to
687.\" perform MTU discovery to the RP, and keep those discoveries inside
688.\" the kernel so the encapsulating router may send back ICMP
689.\" Fragmentation Required if the size of the multicast data packet is
690.\" too large (see "Encapsulating data packets in the Register Tunnel"
691.\" in Section 4.4.1 in the PIM-SM spec
692.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}).
693.\" For IPv4 we may be able to get away without it, but for IPv6 we need
694.\" that.
695.\"
696.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls".
697.\"
698.Pp
699Typically, a multicast routing user-level process would need to know the
700forwarding bandwidth for some data flow.
701For example, the multicast routing process may want to timeout idle MFC
702entries, or in case of PIM-SM it can initiate (S,G) shortest-path switch if
703the bandwidth rate is above a threshold for example.
704.Pp
705The original solution for measuring the bandwidth of a dataflow was
706that a user-level process would periodically
707query the kernel about the number of forwarded packets/bytes per
708(S,G), and then based on those numbers it would estimate whether a source
709has been idle, or whether the source's transmission bandwidth is above a
710threshold.
711That solution is far from being scalable, hence the need for a new
712mechanism for bandwidth monitoring.
713.Pp
714Below is a description of the bandwidth monitoring mechanism.
715.Bl -bullet
716.It
717If the bandwidth of a data flow satisfies some pre-defined filter,
718the kernel delivers an upcall on the multicast routing socket
719to the multicast routing process that has installed that filter.
720.It
721The bandwidth-upcall filters are installed per (S,G).
722There can be
723more than one filter per (S,G).
724.It
725Instead of supporting all possible comparison operations
726(i.e., < <= == != > >= ), there is support only for the
727<= and >= operations,
728because this makes the kernel-level implementation simpler,
729and because practically we need only those two.
730Further, the missing operations can be simulated by secondary
731user-level filtering of those <= and >= filters.
732For example, to simulate !=, then we need to install filter
733.Dq bw <= 0xffffffff ,
734and after an
735upcall is received, we need to check whether
736.Dq measured_bw != expected_bw .
737.It
738The bandwidth-upcall mechanism is enabled by
739.Fn setsockopt MRT_API_CONFIG
740for the
741.Dv MRT_MFC_BW_UPCALL
742flag.
743.It
744The bandwidth-upcall filters are added/deleted by the new
745.Fn setsockopt MRT_ADD_BW_UPCALL
746and
747.Fn setsockopt MRT_DEL_BW_UPCALL
748respectively (with the appropriate
749.Vt "struct bw_upcall"
750argument of course).
751.El
752.Pp
753From application point of view, a developer needs to know about
754the following:
755.Bd -literal
756/*
757 * Structure for installing or delivering an upcall if the
758 * measured bandwidth is above or below a threshold.
759 *
760 * User programs (e.g. daemons) may have a need to know when the
761 * bandwidth used by some data flow is above or below some threshold.
762 * This interface allows the userland to specify the threshold (in
763 * bytes and/or packets) and the measurement interval. Flows are
764 * all packet with the same source and destination IP address.
765 * At the moment the code is only used for multicast destinations
766 * but there is nothing that prevents its use for unicast.
767 *
768 * The measurement interval cannot be shorter than some Tmin (currently, 3s).
769 * The threshold is set in packets and/or bytes per_interval.
770 *
771 * Measurement works as follows:
772 *
773 * For >= measurements:
774 * The first packet marks the start of a measurement interval.
775 * During an interval we count packets and bytes, and when we
776 * pass the threshold we deliver an upcall and we are done.
777 * The first packet after the end of the interval resets the
778 * count and restarts the measurement.
779 *
780 * For <= measurement:
781 * We start a timer to fire at the end of the interval, and
782 * then for each incoming packet we count packets and bytes.
783 * When the timer fires, we compare the value with the threshold,
784 * schedule an upcall if we are below, and restart the measurement
785 * (reschedule timer and zero counters).
786 */
787
788struct bw_data {
789        struct timeval  b_time;
790        uint64_t        b_packets;
791        uint64_t        b_bytes;
792};
793
794struct bw_upcall {
795        struct in_addr  bu_src;         /* source address            */
796        struct in_addr  bu_dst;         /* destination address       */
797        uint32_t        bu_flags;       /* misc flags (see below)    */
798#define BW_UPCALL_UNIT_PACKETS (1 << 0) /* threshold (in packets)    */
799#define BW_UPCALL_UNIT_BYTES   (1 << 1) /* threshold (in bytes)      */
800#define BW_UPCALL_GEQ          (1 << 2) /* upcall if bw >= threshold */
801#define BW_UPCALL_LEQ          (1 << 3) /* upcall if bw <= threshold */
802#define BW_UPCALL_DELETE_ALL   (1 << 4) /* delete all upcalls for s,d*/
803        struct bw_data  bu_threshold;   /* the bw threshold          */
804        struct bw_data  bu_measured;    /* the measured bw           */
805};
806
807/* max. number of upcalls to deliver together */
808#define BW_UPCALLS_MAX				128
809/* min. threshold time interval for bandwidth measurement */
810#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC	3
811#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC	0
812.Ed
813.Pp
814The
815.Vt bw_upcall
816structure is used as an argument to
817.Fn setsockopt MRT_ADD_BW_UPCALL
818and
819.Fn setsockopt MRT_DEL_BW_UPCALL .
820Each
821.Fn setsockopt MRT_ADD_BW_UPCALL
822installs a filter in the kernel
823for the source and destination address in the
824.Vt bw_upcall
825argument,
826and that filter will trigger an upcall according to the following
827pseudo-algorithm:
828.Bd -literal
829 if (bw_upcall_oper IS ">=") {
830    if (((bw_upcall_unit & PACKETS == PACKETS) &&
831         (measured_packets >= threshold_packets)) ||
832        ((bw_upcall_unit & BYTES == BYTES) &&
833         (measured_bytes >= threshold_bytes)))
834       SEND_UPCALL("measured bandwidth is >= threshold");
835  }
836  if (bw_upcall_oper IS "<=" && measured_interval >= threshold_interval) {
837    if (((bw_upcall_unit & PACKETS == PACKETS) &&
838         (measured_packets <= threshold_packets)) ||
839        ((bw_upcall_unit & BYTES == BYTES) &&
840         (measured_bytes <= threshold_bytes)))
841       SEND_UPCALL("measured bandwidth is <= threshold");
842  }
843.Ed
844.Pp
845In the same
846.Vt bw_upcall
847the unit can be specified in both BYTES and PACKETS.
848However, the GEQ and LEQ flags are mutually exclusive.
849.Pp
850Basically, an upcall is delivered if the measured bandwidth is >= or
851<= the threshold bandwidth (within the specified measurement
852interval).
853For practical reasons, the smallest value for the measurement
854interval is 3 seconds.
855If smaller values are allowed, then the bandwidth
856estimation may be less accurate, or the potentially very high frequency
857of the generated upcalls may introduce too much overhead.
858For the >= operation, the answer may be known before the end of
859.Va threshold_interval ,
860therefore the upcall may be delivered earlier.
861For the <= operation however, we must wait
862until the threshold interval has expired to know the answer.
863.Pp
864Example of usage:
865.Bd -literal
866struct bw_upcall bw_upcall;
867/* Assign all bw_upcall fields as appropriate */
868memset(&bw_upcall, 0, sizeof(bw_upcall));
869memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src));
870memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst));
871bw_upcall.bu_threshold.b_data = threshold_interval;
872bw_upcall.bu_threshold.b_packets = threshold_packets;
873bw_upcall.bu_threshold.b_bytes = threshold_bytes;
874if (is_threshold_in_packets)
875    bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS;
876if (is_threshold_in_bytes)
877    bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES;
878do {
879    if (is_geq_upcall) {
880        bw_upcall.bu_flags |= BW_UPCALL_GEQ;
881        break;
882    }
883    if (is_leq_upcall) {
884        bw_upcall.bu_flags |= BW_UPCALL_LEQ;
885        break;
886    }
887    return (ERROR);
888} while (0);
889setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL,
890          (void *)&bw_upcall, sizeof(bw_upcall));
891.Ed
892.Pp
893To delete a single filter, then use
894.Dv MRT_DEL_BW_UPCALL ,
895and the fields of bw_upcall must be set
896exactly same as when
897.Dv MRT_ADD_BW_UPCALL
898was called.
899.Pp
900To delete all bandwidth filters for a given (S,G), then
901only the
902.Va bu_src
903and
904.Va bu_dst
905fields in
906.Vt "struct bw_upcall"
907need to be set, and then just set only the
908.Dv BW_UPCALL_DELETE_ALL
909flag inside field
910.Va bw_upcall.bu_flags .
911.Pp
912The bandwidth upcalls are received by aggregating them in the new upcall
913message:
914.Bd -literal
915#define IGMPMSG_BW_UPCALL  4  /* BW monitoring upcall */
916.Ed
917.Pp
918This message is an array of
919.Vt "struct bw_upcall"
920elements (up to
921.Dv BW_UPCALLS_MAX
922= 128).
923The upcalls are
924delivered when there are 128 pending upcalls, or when 1 second has
925expired since the previous upcall (whichever comes first).
926In an
927.Vt "struct upcall"
928element, the
929.Va bu_measured
930field is filled-in to
931indicate the particular measured values.
932However, because of the way
933the particular intervals are measured, the user should be careful how
934.Va bu_measured.b_time
935is used.
936For example, if the
937filter is installed to trigger an upcall if the number of packets
938is >= 1, then
939.Va bu_measured
940may have a value of zero in the upcalls after the
941first one, because the measured interval for >= filters is
942.Dq clocked
943by the forwarded packets.
944Hence, this upcall mechanism should not be used for measuring
945the exact value of the bandwidth of the forwarded data.
946To measure the exact bandwidth, the user would need to
947get the forwarded packets statistics with the
948.Fn ioctl SIOCGETSGCNT
949mechanism
950(see the
951.Sx Programming Guide
952section) .
953.Pp
954Note that the upcalls for a filter are delivered until the specific
955filter is deleted, but no more frequently than once per
956.Va bu_threshold.b_time .
957For example, if the filter is specified to
958deliver a signal if bw >= 1 packet, the first packet will trigger a
959signal, but the next upcall will be triggered no earlier than
960.Va bu_threshold.b_time
961after the previous upcall.
962.\"
963.Sh SEE ALSO
964.Xr getsockopt 2 ,
965.Xr recvfrom 2 ,
966.Xr recvmsg 2 ,
967.Xr setsockopt 2 ,
968.Xr socket 2 ,
969.Xr sourcefilter 3 ,
970.Xr altq 4 ,
971.Xr dummynet 4 ,
972.Xr gif 4 ,
973.Xr gre 4 ,
974.Xr icmp6 4 ,
975.Xr igmp 4 ,
976.Xr inet 4 ,
977.Xr inet6 4 ,
978.Xr intro 4 ,
979.Xr ip 4 ,
980.Xr ip6 4 ,
981.Xr mld 4 ,
982.Xr pim 4
983.\"
984.Sh HISTORY
985The Distance Vector Multicast Routing Protocol (DVMRP)
986was the first developed multicast routing protocol.
987Later, other protocols such as Multicast Extensions to OSPF (MOSPF)
988and Core Based Trees (CBT), were developed as well.
989Routers at autonomous system boundaries may now exchange multicast
990routes with peers via the Border Gateway Protocol (BGP).
991Many other routing protocols are able to redistribute multicast routes
992for use with
993.Dv PIM-SM
994and
995.Dv PIM-DM .
996.Sh AUTHORS
997.An -nosplit
998The original multicast code was written by
999.An David Waitzman
1000(BBN Labs),
1001and later modified by the following individuals:
1002.An Steve Deering
1003(Stanford),
1004.An Mark J. Steiglitz
1005(Stanford),
1006.An Van Jacobson
1007(LBL),
1008.An Ajit Thyagarajan
1009(PARC),
1010.An Bill Fenner
1011(PARC).
1012The IPv6 multicast support was implemented by the KAME project
1013.Pq Pa https://www.kame.net ,
1014and was based on the IPv4 multicast code.
1015The advanced multicast API and the multicast bandwidth
1016monitoring were implemented by
1017.An Pavlin Radoslavov
1018(ICSI)
1019in collaboration with
1020.An Chris Brown
1021(NextHop).
1022The IGMPv3 and MLDv2 multicast support was implemented by
1023.An Bruce Simpson .
1024.Pp
1025This manual page was written by
1026.An Pavlin Radoslavov
1027(ICSI).
1028