1CDDL HEADER START 2 3The contents of this file are subject to the terms of the 4Common Development and Distribution License (the "License"). 5You may not use this file except in compliance with the License. 6 7You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 8or http://www.opensolaris.org/os/licensing. 9See the License for the specific language governing permissions 10and limitations under the License. 11 12When distributing Covered Code, include this CDDL HEADER in each 13file and include the License file at usr/src/OPENSOLARIS.LICENSE. 14If applicable, add the following below this CDDL HEADER, with the 15fields enclosed by brackets "[]" replaced with your own identifying 16information: Portions Copyright [yyyy] [name of copyright owner] 17 18CDDL HEADER END 19 20Copyright 2007 Sun Microsystems, Inc. All rights reserved. 21Use is subject to license terms. 22 23Architectural Overview for the DHCP agent 24Peter Memishian 25 26INTRODUCTION 27============ 28 29The Solaris DHCP agent (dhcpagent) is a DHCP client implementation 30compliant with RFCs 2131, 3315, and others. The major forces shaping 31its design were: 32 33 * Must be capable of managing multiple network interfaces. 34 * Must consume little CPU, since it will always be running. 35 * Must have a small memory footprint, since it will always be 36 running. 37 * Must not rely on any shared libraries outside of /lib, since 38 it must run before all filesystems have been mounted. 39 40When a DHCP agent implementation is only required to control a single 41interface on a machine, the problem is expressed well as a simple 42state-machine, as shown in RFC2131. However, when a DHCP agent is 43responsible for managing more than one interface at a time, the 44problem becomes much more complicated. 45 46This can be resolved using threads or with an event-driven model. 47Given that DHCP's behavior can be expressed concisely as a state 48machine, the event-driven model is the closest match. 49 50While tried-and-true, that model is subtle and easy to get wrong. 51Indeed, much of the agent's code is there to manage the complexity of 52programming in an asynchronous event-driven paradigm. 53 54THE BASICS 55========== 56 57The DHCP agent consists of roughly 30 source files, most with a 58companion header file. While the largest source file is around 1700 59lines, most are much shorter. The source files can largely be broken 60up into three groups: 61 62 * Source files that, along with their companion header files, 63 define an abstract "object" that is used by other parts of 64 the system. Examples include "packet.c", which along with 65 "packet.h" provide a Packet object for use by the rest of 66 the agent; and "async.c", which along with "async.h" defines 67 an interface for managing asynchronous transactions within 68 the agent. 69 70 * Source files that implement a given state of the agent; for 71 instance, there is a "request.c" which comprises all of 72 the procedural "work" which must be done while in the 73 REQUESTING state of the agent. By encapsulating states in 74 files, it becomes easier to debug errors in the 75 client/server protocol and adapt the agent to new 76 constraints, since all the relevant code is in one place. 77 78 * Source files, which along with their companion header files, 79 encapsulate a given task or related set of tasks. The 80 difference between this and the first group is that the 81 interfaces exported from these files do not operate on 82 an "object", but rather perform a specific task. Examples 83 include "defaults.c", which provides a useful interface 84 to /etc/default/dhcpagent file operations. 85 86OVERVIEW 87======== 88 89Here we discuss the essential objects and subtle aspects of the 90DHCP agent implementation. Note that there is of course much more 91that is not discussed here, but after this overview you should be able 92to fend for yourself in the source code. 93 94For details on the DHCPv6 aspects of the design, and how this relates 95to the implementation present in previous releases of Solaris, see the 96README.v6 file. 97 98Event Handlers and Timer Queues 99------------------------------- 100 101The most important object in the agent is the event handler, whose 102interface is in libinetutil.h and whose implementation is in 103libinetutil. The event handler is essentially an object-oriented 104wrapper around poll(2): other components of the agent can register to 105be called back when specific events on file descriptors happen -- for 106instance, to wait for requests to arrive on its IPC socket, the agent 107registers a callback function (accept_event()) that will be called 108back whenever a new connection arrives on the file descriptor 109associated with the IPC socket. When the agent initially begins in 110main(), it registers a number of events with the event handler, and 111then calls iu_handle_events(), which proceeds to wait for events to 112happen -- this function does not return until the agent is shutdown 113via signal. 114 115When the registered events occur, the callback functions are called 116back, which in turn might lead to additional callbacks being 117registered -- this is the classic event-driven model. (As an aside, 118note that programming in an event-driven model means that callbacks 119cannot block, or else the agent will become unresponsive.) 120 121A special kind of "event" is a timeout. Since there are many timers 122which must be maintained for each DHCP-controlled interface (such as a 123lease expiration timer, time-to-first-renewal (t1) timer, and so 124forth), an object-oriented abstraction to timers called a "timer 125queue" is provided, whose interface is in libinetutil.h with a 126corresponding implementation in libinetutil. The timer queue allows 127callback functions to be "scheduled" for callback after a certain 128amount of time has passed. 129 130The event handler and timer queue objects work hand-in-hand: the event 131handler is passed a pointer to a timer queue in iu_handle_events() -- 132from there, it can use the iu_earliest_timer() routine to find the 133timer which will next fire, and use this to set its timeout value in 134its call to poll(2). If poll(2) returns due to a timeout, the event 135handler calls iu_expire_timers() to expire all timers that expired 136(note that more than one may have expired if, for example, multiple 137timers were set to expire at the same time). 138 139Although it is possible to instantiate more than one timer queue or 140event handler object, it doesn't make a lot of sense -- these objects 141are really "singletons". Accordingly, the agent has two global 142variables, `eh' and `tq', which store pointers to the global event 143handler and timer queue. 144 145Network Interfaces 146------------------ 147 148For each network interface managed by the agent, there is a set of 149associated state that describes both its general properties (such as 150the maximum MTU) and its connections to DHCP-related state (the 151protocol state machines). This state is stored in a pair of 152structures called `dhcp_pif_t' (the IP physical interface layer or 153PIF) and `dhcp_lif_t' (the IP logical interface layer or LIF). Each 154dhcp_pif_t represents a single physical interface, such as "hme0," for 155a given IP protocol version (4 or 6), and has a list of dhcp_lif_t 156structures representing the logical interfaces (such as "hme0:1") in 157use by the agent. 158 159This split is important because of differences between IPv4 and IPv6. 160For IPv4, each DHCP state machine manages a single IP address and 161associated configuration data. This corresponds to a single logical 162interface, which must be specified by the user. For IPv6, however, 163each DHCP state machine manages a group of addresses, and is 164associated with DUID value rather than with just an interface. 165 166Thus, DHCPv6 behaves more like in.ndpd in its creation of "ADDRCONF" 167interfaces. The agent automatically plumbs logical interfaces when 168needed and removes them when the addresses expire. 169 170The state for a given session is stored separately in `dhcp_smach_t'. 171This state machine then points to the main LIF used for I/O, and to a 172list of `dhcp_lease_t' structures representing individual leases, and 173each of those points to a list of LIFs corresponding to the individual 174addresses being managed. 175 176One point that was brushed over in the preceding discussion of event 177handlers and timer queues was context. Recall that the event-driven 178nature of the agent requires that functions cannot block, lest they 179starve out others and impact the observed responsiveness of the agent. 180As an example, consider the process of extending a lease: the agent 181must send a REQUEST packet and wait for an ACK or NAK packet in 182response. This is done by sending a REQUEST and then returning to the 183event handler that waits for an ACK or NAK packet to arrive on the 184file descriptor associated with the interface. Note however, that 185when the ACK or NAK does arrive, and the callback function called 186back, it must know which state machine this packet is for (it must get 187back its context). This could be handled through an ad-hoc mapping of 188file descriptors to state machines, but a cleaner approach is to have 189the event handler's register function (iu_register_event()) take in an 190opaque context pointer, which will then be passed back to the 191callback. In the agent, the context pointer used depends on the 192nature of the event: events on LIFs use the dhcp_lif_t pointer, events 193on the state machine use dhcp_smach_t, and so on. 194 195Note that there is nothing that guarantees the pointer passed into 196iu_register_event() or iu_schedule_timer() will still be valid when 197the callback is called back (for instance, the memory may have been 198freed in the meantime). To solve this problem, all of the data 199structures used in this way are reference counted. For more details 200on how the reference count scheme is implemented, see the closing 201comments in interface.h regarding memory management. 202 203Transactions 204------------ 205 206Many operations performed via DHCP must be performed in groups -- for 207instance, acquiring a lease requires several steps: sending a 208DISCOVER, collecting OFFERs, selecting an OFFER, sending a REQUEST, 209and receiving an ACK, assuming everything goes well. Note however 210that due to the event-driven model the agent operates in, these 211operations are not inherently "grouped" -- instead, the agent sends a 212DISCOVER, goes back into the main event loop, waits for events 213(perhaps even requests on the IPC channel to begin acquiring a lease 214on another state machine), eventually checks to see if an acceptable 215OFFER has come in, and so forth. To some degree, the notion of the 216state machine's current state (SELECTING, REQUESTING, etc) helps 217control the potential chaos of the event-driven model (for instance, 218if while the agent is waiting for an OFFER on a given state machine, 219an IPC event comes in requesting that the leases be RELEASED, the 220agent knows to send back an error since the state machine must be in 221at least the BOUND state before a RELEASE can be performed.) 222 223However, states are not enough -- for instance, suppose that the agent 224begins trying to renew a lease. This is done by sending a REQUEST 225packet and waiting for an ACK or NAK, which might never come. If, 226while waiting for the ACK or NAK, the user sends a request to renew 227the lease as well, then if the agent were to send another REQUEST, 228things could get quite complicated (and this is only the beginning of 229this rathole). To protect against this, two objects exist: 230`async_action' and `ipc_action'. These objects are related, but 231independent of one another; the more essential object is the 232`async_action', which we will discuss first. 233 234In short, an `async_action' represents a pending transaction (aka 235asynchronous action), of which each state machine can have at most 236one. The `async_action' structure is embedded in the `dhcp_smach_t' 237structure, which is fine since there can be at most one pending 238transaction per state machine. Typical "asynchronous transactions" 239are START, EXTEND, and INFORM, since each consists of a sequence of 240packets that must be done without interruption. Note that not all 241DHCP operations are "asynchronous" -- for instance, a DHCPv4 RELEASE 242operation is synchronous (not asynchronous) since after the RELEASE is 243sent no reply is expected from the DHCP server, but DHCPv6 Release is 244asynchronous, as all DHCPv6 messages are transactional. Some 245operations, such as status query, are synchronous and do not affect 246the system state, and thus do not require sequencing. 247 248When the agent realizes it must perform an asynchronous transaction, 249it calls async_async() to open the transaction. If one is already 250pending, then the new transaction must fail (the details of failure 251depend on how the transaction was initiated, which is described in 252more detail later when the `ipc_action' object is discussed). If 253there is no pending asynchronous transaction, the operation succeeds. 254 255When the transaction is complete, either async_finish() or 256async_cancel() must be called to complete or cancel the asynchronous 257action on that state machine. If the transaction is unable to 258complete within a certain amount of time (more on this later), a timer 259should be used to cancel the operation. 260 261The notion of asynchronous transactions is complicated by the fact 262that they may originate from both inside and outside of the agent. 263For instance, a user initiates an asynchronous START transaction when 264he performs an `ifconfig hme0 dhcp start', but the agent will 265internally need to perform asynchronous EXTEND transactions to extend 266the lease before it expires. Note that user-initiated actions always 267have priority over internal actions: the former will cancel the 268latter, if necessary. 269 270This leads us into the `ipc_action' object. An `ipc_action' 271represents the IPC-related pieces of an asynchronous transaction that 272was started as a result of a user request, as well as the `BUSY' state 273of the administrative interface. Only IPC-generated asynchronous 274transactions have a valid `ipc_action' object. Note that since there 275can be at most one asynchronous action per state machine, there can 276also be at most one `ipc_action' per state machine (this means it can 277also conveniently be embedded inside the `dhcp_smach_t' structure). 278 279One of the main purposes of the `ipc_action' object is to timeout user 280events. When the user specifies a timeout value as an argument to 281ifconfig, he is specifying an `ipc_action' timeout; in other words, 282how long he is willing to wait for the command to complete. When this 283time expires, the ipc_action is terminated, as well as the 284asynchronous operation. 285 286The API provided for the `ipc_action' object is quite similar to the 287one for the `async_action' object: when an IPC request comes in for an 288operation requiring asynchronous operation, ipc_action_start() is 289called. When the request completes, ipc_action_finish() is called. 290If the user times out before the request completes, then 291ipc_action_timeout() is called. 292 293Packet Management 294----------------- 295 296Another complicated area is packet management: building, manipulating, 297sending and receiving packets. These operations are all encapsulated 298behind a dozen or so interfaces (see packet.h) that abstract the 299unimportant details away from the rest of the agent code. In order to 300send a DHCP packet, code first calls init_pkt(), which returns a 301dhcp_pkt_t initialized suitably for transmission. Note that currently 302init_pkt() returns a dhcp_pkt_t that is actually allocated as part of 303the `dhcp_smach_t', but this may change in the future.. After calling 304init_pkt(), the add_pkt_opt*() functions are used to add options to 305the DHCP packet. Finally, send_pkt() and send_pkt_v6() can be used to 306transmit the packet to a given IP address. 307 308The send_pkt() function handles the details of packet timeout and 309retransmission. The last argument to send_pkt() is a pointer to a 310"stop function." If this argument is passed as NULL, then the packet 311will only be sent once (it won't be retransmitted). Otherwise, before 312each retransmission, the stop function will be called back prior to 313retransmission. The callback may alter dsm_send_timeout if necessary 314to place a cap on the next timeout; this is done for DHCPv6 in 315stop_init_reboot() in order to implement the CNF_MAX_RD constraint. 316 317The return value from this function indicates whether to continue 318retransmission or not, which allows the send_pkt() caller to control 319the retransmission policy without making it have to deal with the 320retransmission mechanism. See request.c for an example of this in 321action. 322 323The recv_pkt() function is simpler but still complicated by the fact 324that one may want to receive several different types of packets at 325once. The caller registers an event handler on the file descriptor, 326and then calls recv_pkt() to read in the packet along with meta 327information about the message (the sender and interface identifier). 328 329For IPv6, packet reception is done with a single socket, using 330IPV6_PKTINFO to determine the actual destination address and receiving 331interface. Packets are then matched against the state machines on the 332given interface through the transaction ID. 333 334For IPv4, due to oddities in the DHCP specification (discussed in 335PSARC/2007/571), a special IP_DHCPINIT_IF socket option must be used 336to allow unicast DHCP traffic to be received on an interface during 337lease acquisition. Since the IP_DHCPINIT_IF socket option can only 338enable one interface at a time, one socket must be used per interface. 339 340Time 341---- 342 343The notion of time is an exceptionally subtle area. You will notice 344five ways that time is represented in the source: as lease_t's, 345uint32_t's, time_t's, hrtime_t's, and monosec_t's. Each of these 346types serves a slightly different function. 347 348The `lease_t' type is the simplest to understand; it is the unit of 349time in the CD_{LEASE,T1,T2}_TIME options in a DHCP packet, as defined 350by RFC2131. This is defined as a positive number of seconds (relative 351to some fixed point in time) or the value `-1' (DHCP_PERM) which 352represents infinity (i.e., a permanent lease). The lease_t should be 353used either when dealing with actual DHCP packets that are sent on the 354wire or for variables which follow the exact definition given in the 355RFC. 356 357The `uint32_t' type is also used to represent a relative time in 358seconds. However, here the value `-1' is not special and of course 359this type is not tied to any definition given in RFC2131. Use this 360for representing "offsets" from another point in time that are not 361DHCP lease times. 362 363The `time_t' type is the natural Unix type for representing time since 364the epoch. Unfortunately, it is affected by stime(2) or adjtime(2) 365and since the DHCP client is used during system installation (and thus 366when time is typically being configured), the time_t cannot be used in 367general to represent an absolute time since the epoch. For instance, 368if a time_t were used to keep track of when a lease began, and then a 369minute later stime(2) was called to adjust the system clock forward a 370year, then the lease would appeared to have expired a year ago even 371though it has only been a minute. For this reason, time_t's should 372only be used either when wall time must be displayed (such as in 373DHCP_STATUS ipc transaction) or when a time meaningful across reboots 374must be obtained (such as when caching an ACK packet at system 375shutdown). 376 377The `hrtime_t' type returned from gethrtime() works around the 378limitations of the time_t in that it is not affected by stime(2) or 379adjtime(2), with the disadvantage that it represents time from some 380arbitrary time in the past and in nanoseconds. The timer queue code 381deals with hrtime_t's directly since that particular piece of code is 382meant to be fairly independent of the rest of the DHCP client. 383 384However, dealing with nanoseconds is error-prone when all the other 385time types are in seconds. As a result, yet another time type, the 386`monosec_t' was created to represent a monotonically increasing time 387in seconds, and is really no more than (hrtime_t / NANOSEC). Note 388that this unit is typically used where time_t's would've traditionally 389been used. The function monosec() in util.c returns the current 390monosec, and monosec_to_time() can convert a given monosec to wall 391time, using the system's current notion of time. 392 393One additional limitation of the `hrtime_t' and `monosec_t' types is 394that they are unaware of the passage of time across checkpoint/resume 395events (e.g., those generated by sys-suspend(8)). For example, if 396gethrtime() returns time T, and then the machine is suspended for 2 397hours, and then gethrtime() is called again, the time returned is not 398T + (2 * 60 * 60 * NANOSEC), but rather approximately still T. 399 400To work around this (and other checkpoint/resume related problems), 401when a system is resumed, the DHCP client makes the pessimistic 402assumption that all finite leases have expired while the machine was 403suspended and must be obtained again. This is known as "refreshing" 404the leases, and is handled by refresh_smachs(). 405 406Note that it appears like a more intelligent approach would be to 407record the time(2) when the system is suspended, compare that against 408the time(2) when the system is resumed, and use the delta between them 409to decide which leases have expired. Sadly, this cannot be done since 410through at least Solaris 10, it is not possible for userland programs 411to be notified of system suspend events. 412 413Configuration 414------------- 415 416For the most part, the DHCP client only *retrieves* configuration data 417from the DHCP server, leaving the configuration to scripts (such as 418boot scripts), which themselves use dhcpinfo(1) to retrieve the data 419from the DHCP client. This is desirable because it keeps the mechanism 420of retrieving the configuration data decoupled from the policy of using 421the data. 422 423However, unless used in "inform" mode, the DHCP client *does* 424configure each IP interface enough to allow it to communicate with 425other hosts. Specifically, the DHCP client configures the interface's 426IP address, netmask, and broadcast address using the information 427provided by the server. Further, for IPv4 logical interface 0 428("hme0"), any provided default routes are also configured. 429 430For IPv6, only the IP addresses are set. The netmask (prefix) is then 431set automatically by in.ndpd, and routes are discovered in the usual 432way by router discovery or routing protocols. DHCPv6 doesn't set 433routes. 434 435Since logical interfaces cannot be specified as output interfaces in 436the kernel forwarding table, and in most cases, logical interfaces 437share a default route with their associated physical interface, the 438DHCP client does not automatically add or remove default routes when 439IPv4 leases are acquired or expired on logical interfaces. 440 441Event Scripting 442--------------- 443 444The DHCP client supports user program invocations on DHCP events. The 445supported events are BOUND, EXTEND, EXPIRE, DROP, RELEASE, and INFORM 446for DHCPv4, and BUILD6, EXTEND6, EXPIRE6, DROP6, LOSS6, RELEASE6, and 447INFORM6 for DHCPv6. The user program runs asynchronous to the DHCP 448client so that the main event loop stays active to process other 449events, including events triggered by the user program (for example, 450when it invokes dhcpinfo). 451 452The user program execution is part of the transaction of a DHCP command. 453For example, if the user program is not enabled, the transaction of the 454DHCP command START is considered over when an ACK is received and the 455interface is configured successfully. If the user program is enabled, 456it is invoked after the interface is configured successfully, and the 457transaction is considered over only when the user program exits. The 458event scripting implementation makes use of the asynchronous operations 459discussed in the "Transactions" section. 460 461An upper bound of 58 seconds is imposed on how long the user program 462can run. If the user program does not exit after 55 seconds, the signal 463SIGTERM is sent to it. If it still does not exit after additional 3 464seconds, the signal SIGKILL is sent to it. Since the event handler is 465a wrapper around poll(), the DHCP client cannot directly observe the 466completion of the user program. Instead, the DHCP client creates a 467child "helper" process to synchronously monitor the user program (this 468process is also used to send the aformentioned signals to the process, 469if necessary). The DHCP client and the helper process share a pipe 470which is included in the set of poll descriptors monitored by the DHCP 471client's event handler. When the user program exits, the helper process 472passes the user program exit status to the DHCP client through the pipe, 473informing the DHCP client that the user program has finished. When the 474DHCP client is asked to shut down, it will wait for any running instances 475of the user program to complete. 476