1CDDL HEADER START 2 3The contents of this file are subject to the terms of the 4Common Development and Distribution License, Version 1.0 only 5(the "License"). You may not use this file except in compliance 6with the License. 7 8You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9or http://www.opensolaris.org/os/licensing. 10See the License for the specific language governing permissions 11and limitations under the License. 12 13When distributing Covered Code, include this CDDL HEADER in each 14file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15If applicable, add the following below this CDDL HEADER, with the 16fields enclosed by brackets "[]" replaced with your own identifying 17information: Portions Copyright [yyyy] [name of copyright owner] 18 19CDDL HEADER END 20 21Copyright 2004 Sun Microsystems, Inc. All rights reserved. 22Use is subject to license terms. 23 24Architectural Overview for the DHCP agent 25Peter Memishian 26ident "%Z%%M% %I% %E% SMI" 27 28INTRODUCTION 29============ 30 31The Solaris DHCP agent (dhcpagent) is an RFC2131-compliant DHCP client 32implementation. The major forces shaping its design were: 33 34 * Must be capable of managing multiple network interfaces. 35 * Must consume little CPU, since it will always be running. 36 * Must have a small memory footprint, since it will always be 37 running. 38 * Must not rely on any shared libraries, since it must run 39 before all filesystems have been mounted. 40 41When a DHCP agent implementation is only required to control a single 42interface on a machine, the problem is expressed well as a simple 43state-machine, as shown in RFC2131. However, when a DHCP agent is 44responsible for managing more than one interface at a time, the 45problem becomes much more complicated, especially when threads cannot 46be used to attack the problem (since the filesystems containing the 47thread libraries may not be available when the agent starts). 48Instead, the problem must be solved using an event-driven model, which 49while tried-and-true, is subtle and easy to get wrong. Indeed, much 50of the agent's code is there to manage the complexity of programming 51in an asynchronous event-driven paradigm. 52 53THE BASICS 54========== 55 56The DHCP agent consists of roughly 20 source files, most with a 57companion header file. While the largest source file is around 700 58lines, most are much shorter. The source files can largely be broken 59up into three groups: 60 61 * Source files, which along with their companion header files, 62 define an abstract "object" that is used by other parts of 63 the system. Examples include "timer_queue.c", which along 64 with "timer_queue.h" provide a Timer Queue object for use 65 by the rest of the agent, and "async.c", which along with 66 "async.h" defines an interface for managing asynchronous 67 transactions within the agent. 68 69 * Source files which implement a given state of the agent; for 70 instance, there is a "request.c" which comprises all of 71 the procedural "work" which must be done while in the 72 REQUESTING state of the agent. By encapsulating states in 73 files, it becomes easier to debug errors in the 74 client/server protocol and adapt the agent to new 75 constraints, since all the relevant code is in one place. 76 77 * Source files, which along with their companion header files, 78 encapsulate a given task or related set of tasks. The 79 difference between this and the first group is that the 80 interfaces exported from these files do not operate on 81 an "object", but rather perform a specific task. Examples 82 include "dlpi_io.c", which provides a useful interface 83 to DLPI-related i/o operations. 84 85OVERVIEW 86======== 87 88Here we discuss the essential objects and subtle aspects of the 89DHCP agent implementation. Note that there is of course much more 90that is not discussed here, but after this overview you should be able 91to fend for yourself in the source code. 92 93Event Handlers and Timer Queues 94------------------------------- 95 96The most important object in the agent is the event handler, whose 97interface is in libinetutil.h and whose implementation is in 98libinetutil. The event handler is essentially an object-oriented 99wrapper around poll(2): other components of the agent can register to 100be called back when specific events on file descriptors happen -- for 101instance, to wait for requests to arrive on its IPC socket, the agent 102registers a callback function (accept_event()) that will be called 103back whenever a new connection arrives on the file descriptor 104associated with the IPC socket. When the agent initially begins in 105main(), it registers a number of events with the event handler, and 106then calls iu_handle_events(), which proceeds to wait for events to 107happen -- this function does not return until the agent is shutdown 108via signal. 109 110When the registered events occur, the callback functions are called 111back, which in turn might lead to additional callbacks being 112registered -- this is the classic event-driven model. (As an aside, 113note that programming in an event-driven model means that callbacks 114cannot block, or else the agent will become unresponsive.) 115 116A special kind of "event" is a timeout. Since there are many timers 117which must be maintained for each DHCP-controlled interface (such as a 118lease expiration timer, time-to-first-renewal (t1) timer, and so 119forth), an object-oriented abstraction to timers called a "timer 120queue" is provided, whose interface is in libinetutil.h with a 121corresponding implementation in libinetutil. The timer queue allows 122callback functions to be "scheduled" for callback after a certain 123amount of time has passed. 124 125The event handler and timer queue objects work hand-in-hand: the event 126handler is passed a pointer to a timer queue in iu_handle_events() -- 127from there, it can use the iu_earliest_timer() routine to find the 128timer which will next fire, and use this to set its timeout value in 129its call to poll(2). If poll(2) returns due to a timeout, the event 130handler calls iu_expire_timers() to expire all timers that expired 131(note that more than one may have expired if, for example, multiple 132timers were set to expire at the same time). 133 134Although it is possible to instantiate more than one timer queue or 135event handler object, it doesn't make a lot of sense -- these objects 136are really "singletons". Accordingly, the agent has two global 137variables, `eh' and `tq', which store pointers to the global event 138handler and timer queue. 139 140Network Interfaces 141------------------ 142 143For each network interface managed by the agent, there is a set of 144associated state that describes both its general properties (such as 145the maximum MTU) and its DHCP-related state (such as when it acquired 146a lease). This state is stored in a a structure called an `ifslist', 147which is a poor name (since it suggests implementation artifacts but 148not purpose) but has historical precedent. Another way to think about 149an `ifslist' is that it provides all of the context necessary to 150perform DHCP on a given interface: the state the interface is in, the 151last packet DHCP packet received on that interface, and so forth. As 152one can imagine, the `ifslist' structure is quite complicated and rules 153governing accessing its fields are equally convoluted -- see the 154comments in interface.h for more information. 155 156One point that was brushed over in the preceding discussion of event 157handlers and timer queues was context. Recall that the event-driven 158nature of the agent requires that functions cannot block, lest they 159starve out others and impact the observed responsiveness of the agent. 160As an example, consider the process of extending a lease: the agent 161must send a REQUEST packet and wait for an ACK or NAK packet in 162response. This is done by sending a REQUEST and then registering a 163callback with the event handler that waits for an ACK or NAK packet to 164arrive on the file descriptor associated with the interface. Note 165however, that when the ACK or NAK does arrive, and the callback 166function called back, it must know which interface this packet is for 167(it must get back its context). This could be handled through an 168ad-hoc mapping of file descriptors to interfaces, but a cleaner 169approach is to have the event handler's register function 170(iu_register_event()) take in an opaque context pointer, which will 171then be passed back to the callback. In the agent, this context 172pointer is always the `ifslist', but for reasons of decoupling and 173generality, the timer queue and event handler objects allow a generic 174(void *) context argument. 175 176Note that there is nothing that guarantees the pointer passed into 177iu_register_event() or iu_schedule_timer() will still be valid when 178the callback is called back (for instance, the memory may have been 179freed in the meantime). To solve this problem, ifslists are reference 180counted. For more details on how the reference count scheme is 181implemented, see the closing comments in interface.h regarding memory 182management. 183 184Transactions 185------------ 186 187Many operations performed via DHCP must be performed in groups -- for 188instance, acquiring a lease requires several steps: sending a 189DISCOVER, collecting OFFERs, selecting an OFFER, sending a REQUEST, 190and receiving an ACK, assuming everything goes well. Note however 191that due to the event-driven model the agent operates in, these 192operations are not inherently "grouped" -- instead, the agent sends a 193DISCOVER, goes back into the main event loop, waits for events 194(perhaps even requests on the IPC channel to begin acquiring a lease 195on another interface), eventually checks to see if an acceptable OFFER 196has come in, and so forth. To some degree, the notion of the current 197state of an interface (SELECTING, REQUESTING, etc) helps control the 198potential chaos of the event-driven model (for instance, if while the 199agent is waiting for an OFFER on a given interface, an IPC event comes 200in requesting that the interface be RELEASED, the agent knows to send 201back an error since the interface must be in at least the BOUND state 202before a RELEASE can be performed.) 203 204However, states are not enough -- for instance, suppose that the agent 205begins trying to renew a lease -- this is done by sending a REQUEST 206packet and waiting for an ACK or NAK, which might never come. If, 207while waiting for the ACK or NAK, the user sends a request to renew 208the lease as well, then if the agent were to send another REQUEST, 209things could get quite complicated (and this is only the beginning of 210this rathole). To protect against this, two objects exist: 211`async_action' and `ipc_action'. These objects are related, but 212independent of one another; the more essential object is the 213`async_action', which we will discuss first. 214 215In short, an `async_action' represents a pending transaction (aka 216asynchronous action), of which each interface can have at most one. 217The `async_action' structure is embedded in the `ifslist' structure, 218which is fine since there can be at most one pending transaction per 219interface. Typical "asynchronous transactions" are START, EXTEND, and 220INFORM, since each consists of a sequence of packets that must be done 221without interruption. Note that not all DHCP operations are 222"asynchronous" -- for instance, a RELEASE operation is synchronous 223(not asynchronous) since after the RELEASE is sent no reply is 224expected from the DHCP server. Also, note that there can be 225synchronous operations intermixed with asynchronous operations 226although it's not recommended. 227 228When the agent realizes it must perform an asynchronous transaction, 229it first calls async_pending() to see if there is already one pending; 230if so, the new transaction must fail (the details of failure depend on 231how the transaction was initiated, which is described in more detail 232later when the `ipc_action' object is discussed). If there is no 233pending asynchronous transaction, async_start() is called to begin 234one. 235 236When the transaction is complete, async_finish() must be called to 237complete the asynchronous action on that interface. If the 238transaction is unable to complete within a certain amount of time 239(more on this later), async_timeout() is invoked which attempts to 240cancel the asynchronous action with async_cancel(). If the event is 241not cancellable it is left pending, although this means that no future 242asynchronous actions can be performed on the interface until the 243transaction voluntarily calls async_finish(). While this may seem 244suboptimal, cancellation here is quite analogous to thread 245cancellation, which is generally considered a difficult problem. 246 247The notion of asynchronous transactions is complicated by the fact 248that they may originate from both inside and outside of the agent. 249For instance, a user initiates an asynchronous START transaction when 250he performs an `ifconfig hme0 dhcp start', but the agent will 251internally need to perform asynchronous EXTEND transactions to extend 252the lease before it expires. This leads us into the `ipc_action' 253object. 254 255An `ipc_action' represents the IPC-related pieces of an asynchronous 256transaction that was started as a result of a user request. Only 257IPC-generated asynchronous transactions have a valid `ipc_action' 258object. Note that since there can be at most one asynchronous action 259per interface, there can also be at most one `ipc_action' per 260interface (this means it can also conveniently be embedded inside the 261`ifslist' structure). 262 263One of the main purposes of the `ipc_action' object is to timeout user 264events. This is not the same as timing out the transaction; for 265instance, when the user specifies a timeout value as an argument to 266ifconfig, he is specifying an `ipc_action' timeout; in other words, 267how long he is willing to wait for the command to complete. However, 268even after the command times out for the user, the asynchronous 269transaction continues until async_timeout() occurs. 270 271It is worth understanding these timeouts since the relationship is 272subtle but powerful. The `async_action' timer specifies how long the 273agent will try to perform the transaction; the `ipc_action' timer 274specifies how long the user is willing to wait for the action to 275complete. If when the `async_action' timer fires and async_timeout() 276is called, there is no associated `ipc_action' (either because the 277transaction was not initiated by a user or because the user already 278timed out), then async_cancel() proceeds as described previously. If, 279on the other hand, the user is still waiting for the transaction to 280complete, then async_timeout() is rescheduled and the transaction is 281left pending. While this behavior might seem odd, it adheres to the 282principles of least surprise: when a user is willing to wait for a 283transaction to complete, the agent should try for as long as they're 284willing to wait. On the other hand, if the agent were to take that 285stance with its internal transactions, it would block out 286user-requested operations if the internal transaction never completed 287(perhaps because the server never sent an ACK in response to our lease 288extension REQUEST). 289 290The API provided for the `ipc_action' object is quite similar to the 291one for the `async_action' object: when an IPC request comes in for an 292operation requiring asynchronous operation, ipc_action_start() is 293called. When the request completes, ipc_action_finish() is called. 294If the user times out before the request completes, then 295ipc_action_timeout() is called. 296 297Packet Management 298----------------- 299 300Another complicated area is packet management: building, manipulating, 301sending and receiving packets. These operations are all encapsulated 302behind a dozen or so interfaces (see packet.h) that abstract the 303unimportant details away from the rest of the agent code. In order to 304send a DHCP packet, code first calls init_pkt(), which returns a 305dhcp_pkt_t initialized suitably for transmission. Note that currently 306init_pkt() returns a dhcp_pkt_t that is actually allocated as part of 307the `ifslist', but this may change in the future.. After calling 308init_pkt(), the add_pkt_opt*() functions are used to add options to 309the DHCP packet. Finally, send_pkt() can be used to transmit the 310packet to a given IP address. 311 312The send_pkt() function is actually quite complicated; for one, it 313must internally use either DLPI or sockets depending on the state of 314the interface; for two, it handles the details of packet timeout and 315retransmission. The last argument to send_pkt() is a pointer to a 316"stop function". If this argument is passed as NULL, then the packet 317will only be sent once (it won't be retransmitted). Otherwise, before 318each retransmission, the stop function will be called back prior to 319retransmission. The return value from this function indicates whether 320to continue retransmission or not, which allows the send_pkt() caller 321to control the retransmission policy without making it have to deal 322with the retransmission mechanism. See init_reboot.c for an example 323of this in action. 324 325The recv_pkt() function is simpler but still complicated by the fact 326that one may want to receive several different types of packets at 327once; for instance, after sending a REQUEST, either an ACK or a NAK is 328acceptable. Also, before calling recv_pkt(), the caller must know 329that there is data to be read from the socket (this can be 330accomplished by using the event handler), otherwise recv_pkt() will 331block, which is clearly not acceptable. 332 333Time 334---- 335 336The notion of time is an exceptionally subtle area. You will notice 337five ways that time is represented in the source: as lease_t's, 338uint32_t's, time_t's, hrtime_t's, and monosec_t's. Each of these 339types serves a slightly different function. 340 341The `lease_t' type is the simplest to understand; it is the unit of 342time in the CD_{LEASE,T1,T2}_TIME options in a DHCP packet, as defined 343by RFC2131. This is defined as a positive number of seconds (relative 344to some fixed point in time) or the value `-1' (DHCP_PERM) which 345represents infinity (i.e., a permanent lease). The lease_t should be 346used either when dealing with actual DHCP packets that are sent on the 347wire or for variables which follow the exact definition given in the 348RFC. 349 350The `uint32_t' type is also used to represent a relative time in 351seconds. However, here the value `-1' is not special and of course 352this type is not tied to any definition given in RFC2131. Use this 353for representing "offsets" from another point in time that are not 354DHCP lease times. 355 356The `time_t' type is the natural Unix type for representing time since 357the epoch. Unfortunately, it is affected by stime(2) or adjtime(2) 358and since the DHCP client is used during system installation (and thus 359when time is typically being configured), the time_t cannot be used in 360general to represent an absolute time since the epoch. For instance, 361if a time_t were used to keep track of when a lease began, and then a 362minute later stime(2) was called to adjust the system clock forward a 363year, then the lease would appeared to have expired a year ago even 364though it has only been a minute. For this reason, time_t's should 365only be used either when wall time must be displayed (such as in 366DHCP_STATUS ipc transaction) or when a time meaningful across reboots 367must be obtained (such as when caching an ACK packet at system 368shutdown). 369 370The `hrtime_t' type returned from gethrtime() works around the 371limitations of the time_t in that it is not affected by stime(2) or 372adjtime(2), with the disadvantage that it represents time from some 373arbitrary time in the past and in nanoseconds. The timer queue code 374deals with hrtime_t's directly since that particular piece of code is 375meant to be fairly independent of the rest of the DHCP client. 376 377However, dealing with nanoseconds is error-prone when all the other 378time types are in seconds. As a result, yet another time type, the 379`monosec_t' was created to represent a monotonically increasing time 380in seconds, and is really no more than (hrtime_t / NANOSEC). Note 381that this unit is typically used where time_t's would've traditionally 382been used. The function monosec() in util.c returns the current 383monosec, and monosec_to_time() can convert a given monosec to wall 384time, using the system's current notion of time. 385 386One additional limitation of the `hrtime_t' and `monosec_t' types is 387that they are unaware of the passage of time across checkpoint/resume 388events (e.g., those generated by sys-suspend(1M)). For example, if 389gethrtime() returns time T, and then the machine is suspended for 2 390hours, and then gethrtime() is called again, the time returned is not 391T + (2 * 60 * 60 * NANOSEC), but rather approximately still T. 392 393To work around this (and other checkpoint/resume related problems), 394when a system is resumed, the DHCP client makes the pessimistic 395assumption that all finite leases have expired while the machine was 396suspended and must be obtained again. This is known as "refreshing" 397the leases, and is handled by refresh_ifslist(). 398 399Note that it appears like a more intelligent approach would be to 400record the time(2) when the system is suspended, compare that against 401the time(2) when the system is resumed, and use the delta between them 402to decide which leases have expired. Sadly, this cannot be done since 403through at least Solaris 8, it is not possible for userland programs 404to be notified of system suspend events. 405 406Configuration 407------------- 408 409For the most part, the DHCP client only *retrieves* configuration data 410from the DHCP server, leaving the configuration to scripts (such as 411boot scripts), which themselves use dhcpinfo(1) to retrieve the data 412from the DHCP client. This is desirable because it keeps the mechanism 413of retrieving the configuration data decoupled from the policy of using 414the data. 415 416However, unless used in "inform" mode, the DHCP client *does* configure 417each interface enough to allow it to communicate with other hosts. 418Specifically, the DHCP client configures the interface's IP address, 419netmask, and broadcast address using the information provided by the 420server. Further, for physical interfaces, any provided default routes 421are also configured. Since logical interfaces cannot be stored in the 422kernel routing table, and in most cases, logical interfaces share a 423default route with their associated physical interface, the DHCP client 424does not automatically add or remove default routes when leases are 425acquired or expired on logical interfaces. 426 427Event Scripting 428--------------- 429 430The DHCP client supports user program invocations on DHCP events. The 431supported events are BOUND, EXTEND, EXPIRE, DROP and RELEASE. The user 432program runs asynchronous to the DHCP client so that the main event 433loop stays active to process other events, including events triggered 434by the user program (for example, when it invokes dhcpinfo). 435 436The user program execution is part of the transaction of a DHCP command. 437For example, if the user program is not enabled, the transaction of the 438DHCP command START is considered over when an ACK is received and the 439interface is configured successfully. If the user program is enabled, 440it is invoked after the interface is configured successfully, and the 441transaction is considered over only when the user program exits. The 442event scripting implementation makes use of the asynchronous operations 443discussed in the "Transactions" section. 444 445The upper bound of 58 seconds is imposed on how long the user program 446can run. If the user program does not exit after 55 seconds, the signal 447SIGTERM is sent to it. If it still does not exit after additional 3 448seconds, the signal SIGKILL is sent to it. Since the event handler is 449a wrapper around poll(), the DHCP client cannot directly observe the 450completion of the user program. Instead, the DHCP client creates a 451child "helper" process to synchronously monitor the user program (this 452process is also used to send the aformentioned signals to the process, 453if necessary). The DHCP client and the helper process share a pipe 454which is included in the set of poll descriptors monitored by the DHCP 455client's event handler. When the user program exits, the helper process 456passes the user program exit status to the DHCP client through the pipe, 457informing the DHCP client that the user program has finished. When the 458DHCP client is asked to shut down, it will wait for any running instances 459of the user program to complete. 460