1CDDL HEADER START 2 3The contents of this file are subject to the terms of the 4Common Development and Distribution License (the "License"). 5You may not use this file except in compliance with the License. 6 7You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 8or http://www.opensolaris.org/os/licensing. 9See the License for the specific language governing permissions 10and limitations under the License. 11 12When distributing Covered Code, include this CDDL HEADER in each 13file and include the License file at usr/src/OPENSOLARIS.LICENSE. 14If applicable, add the following below this CDDL HEADER, with the 15fields enclosed by brackets "[]" replaced with your own identifying 16information: Portions Copyright [yyyy] [name of copyright owner] 17 18CDDL HEADER END 19 20Copyright 2010 Sun Microsystems, Inc. All rights reserved. 21Use is subject to license terms. 22 23Implementation Overview for the NetWork AutoMagic daemon 24John Beck, Renee Danson, Michael Hunter, Alan Maguire, Kacheong Poon, 25Garima Tripathi, Jan Xie, Anurag Maskey 26[Structure and some content shamelessly stolen from Peter Memishian's 27dhcpagent architecture overview.] 28 29INTRODUCTION 30============ 31 32Details about the NWAM requirements, architecture, and design are 33available via the NWAM opensolaris project at 34http://opensolaris.org/os/project/nwam. The point of this document is 35to place details relevant to somebody attempting to understand the 36implementation close to the source code. 37 38THE BASICS 39========== 40 41SOURCE FILE ORGANIZATION 42======================= 43event sources: 44 dlpi_events.c 45 routing_events.c 46 sysevent_events.c 47 48object-specific event handlers: 49 enm.c 50 known_wlans.c 51 loc.c 52 ncp.c 53 ncu_ip.c 54 ncu_phys.c 55 56legacy config upgrade 57 llp.c 58 59generic code: 60 objects.c 61 events.c 62 conditions.c 63 logging.c 64 util.c 65 66nwam door requests: 67 door_if.c 68 69entry point: 70 main.c 71 72OVERVIEW 73======== 74 75Here we discuss the essential objects and subtle aspects of the NWAM 76daemon implementation. Note that there is of course much more that is 77not discussed here, but after this overview you should be able to fend 78for yourself in the source code. 79 80Events and Objects 81================== 82 83Events come to NWAM from a variety of different sources asyncronously. 84 85o routing socket 86o dlpi 87o sysevents 88o doors 89 90Routing sockets and dlpi (DL_NOTE_LINK_UP|DOWN events) are handled by 91dedicated threads. Sysevents and doors are both seen as callbacks into 92the process proper and will often post their results to the main event 93queue. All event sources post events onto the main event queue. In 94addition state changes of objects and door requests (requesting current 95state or a change of state, specification of a WiFi key etc) can 96lead to additional events. We have daemon-internal events (object 97initialization, periodic state checks) which are simply enqueued 98on the event queue, and external events which are both enqueued on 99the event queue and sent to registered listeners (via nwam_event_send()). 100 101So the structure of the daemon is a set of threads that drive event 102generation. Events are posted either directly onto the event queue 103or are delayed by posting onto the pending event queue. SIGALARMs 104are set for the event delay, and when the SIGALARM is received 105pending events that have expired are moved onto the event queue 106proper. Delayed enqueueing is useful for periodic checks. 107 108Decisions to change conditions based upon object state changes are 109delayed until after bursts of events. This is achieved by marking a 110flag when it is deemed checking is necessary and then the next time the 111queue is empty performing those checks. A typical event profile will 112be one event (e.g. a link down) causing a flurry of other events (e.g. 113related interface down). By waiting until all the consequences of the 114initial event have been carried out to make higher level decisions we 115implicitly debounce those higher level decisions. 116 117At the moment queue quiet actually means that the queue has been quiet 118for some short period of time (.1s). Typically the flurry of events we 119want to work through are internally generated and are back to back in 120the queue. We wait a bit longer in case there are reprucussions from 121what we do that cause external events to be posted on us. We are not 122interested in waiting for longer term things to happen but merely to 123catch immediate changes. 124 125When running, the daemon will consist of a number of threads: 126 127o the event handling thread: a thread blocking until events appear on the 128 event queue, processing each event in order. Events that require 129 time-consuming processing are spawned in worker threads (e.g. WiFi 130 connect, DHCP requests etc). 131o door request threads: the door infrastructure manages server threads 132 which process synchronous NWAM client requests (e.g. get state of an 133 object, connect to a specific WLAN, initiate a scan on a link etc). 134o various wifi/IP threads: threads which do asynchronous work such as 135 DHCP requests, WLAN scans etc that cannot hold up event processing in 136 the main event handling thread. 137o routing socket threads: process routing socket messages of interest 138 (address additons/deletions) and package them as NWAM messages. 139o dlpi threads: used to monitor for DL_NOTE_LINK messages on links 140 141The daemon is structured around a set of objects representing NCPs[1], 142NCUs[2], ENMs[3] and known WLANs and a set of state machines which 143consume events which act on those objects. Object lists are maintained 144for each object type, and these contain both a libnwam handle (to allow 145reading the object directly) and an optional object data pointer which 146can point to state information used to configure the object. 147 148Events can be associated with specific objects (e.g. link up), or associated 149with no object in particular (e.g. shutdown). 150 151Each object type registers a set of event handler functions with the event 152framework such that when an event occurs, the appropriate handler for the 153object type is used. The event handlers are usually called 154nwamd_handle_*_event(). 155 156[1] NCP Network Configuration Profile; the set of link- and IP-layer 157configuration units which collectively specify how a system should be 158connected to the network 159 160[2] NCU Network Configuration Unit; the individual components of an NCP 161 162[3] ENM External Network Modifiers; user executable scripts often used 163to configure a VPN 164 165Doors and External Events 166========================= 167 168The command interface to nwamd is thread a door at NWAM_DOOR 169(/etc/svc/volatile/nwam/nwam_door). This door allows external program to send 170messages to nwamd. The way doors work is to provide a mechanism for 171another process to execute code in your process space. This looks like 172a CSPish send/receive/reply in that the receiving process provide a 173syncronization point (via door_create(3C)), the calling process uses 174that syncronization point to rendezvous with and provide arguments (via 175door_call(3C), and then the receive process reply (via 176door_return(3C))) passing back data as required. The OS makes it such 177that the memory used to pass data via door_call(3C) is mapped into the 178receiving process which can write back into it and then transparently 179have it mapped back to the calling process. 180 181As well as handling internal events of interest, the daemon also needs 182to send events of interest (link up/down, WLAN scan/connect results etc) 183to (possibly) multiple NWAM client listeners. This is done via 184System V message queues. On registering for events via a libnwam door 185request into the daemon (nwam_events_register()), a per-client 186(identified by pid) message queue file is created. The 187daemon sends messages to all listeners by examining the list of 188message queue files (allowing registration to be robust across 189daemon restarts) and sending events to each listener. This is done 190via the libnwam function nwam_event_send() which hides the IPC 191mechanism from the daemon. 192 193Objects 194======= 195Four object lists are maintained within the daemon - one each for 196the configuration objects libnwam manages. i.e.: 197 198o ENMs 199o locations 200o known WLANs 201o NCUs of the current active NCP 202 203Objects have an associated libnwam handle and an optional data 204field (which is used for NCUs only). 205 206Locking is straightforward - nwamd_object_init() will initialize 207an object of a particular type in the appropriate object list, 208returning it with the object lock held. When it is no longer needed, 209nwamd_object_unlock() should be called on the object. 210 211To retrieve an existing object, nwamd_object_find() should be 212called - again this returns the object in a locked state. 213 214nwamd_object_lock() is deliberately not exposed outside of objects.c, 215since object locking is implicit in the above creation/retrieval 216functions. 217 218An object is removed from the object list (with handle destroyed) 219via nwamd_object_fini() - the object data (if any) is returned 220from this call to allow deallocation. 221 222Object state 223============ 224nwamd deals with 3 broad types of object that need to maintain 225internal state: NCUs, ENMs and locations (known WLANs are configuration 226objects but don't have a state beyond simply being present). 227NWAM objects all share a basic set of states: 228 229State Description 230===== =========== 231uninitialized object representation not present on system or in nwamd 232initialized object representation present in system and in nwamd 233disabled disabled manually 234offline external conditions are not satisfied 235offline* external conditions are satisfied, trying to move online 236online* external conditions no longer satisfied, trying to move offline 237online conditions satisfied and configured 238maintenance error occurred in applying configuration 239 240These deliberately mimic SMF states. 241 242The states of interest are offline, offline* and online. 243 244An object (link/interface NCU, ENM or location) should only move online 245when its conditions are satisfied _and_ its configuration has been successfully 246applied. This occurs when an ENM method has run or a link is up, or an 247interface has at least one address assigned. 248 249To understand the distinction between offline and offline*, consider the case 250where a link is of prioritized activation, and either is a lower priority 251group - and hence inactive (due to cable being unplugged or inability to 252connect to wifi) - or a higher priority group - and hence active. In general, 253we want to distinguish between two cases: 254 2551) when we are actively configuring the link with a view to moving online 256(offline*), as would be the case when the link's priority group is 257active. 2582) when external policy-based conditions prevent a link from being active. 259offline should be used for such cases. Links in priority groups above and 260below the currently-active group will be offline, since policy precludes them 261from activating (as less-prioritized links). 262 263So we see that offline and offline* can thus be used to distinguish between 264cases that have the potentiality to move online (offline*) from a policy 265perspective - i.e. conditions on the location allow it, or link prioritization 266allows it - and cases where external conditions dictate that it should not 267(offline). 268 269Once an object reaches offline*, its configuration processes should kick in. 270This is where auxiliary state is useful, as it allows us to distinguish between 271various states in that configuration process. For example, a link can be 272waiting for WLAN selection or key data, or an interface can be waiting for 273DHCP response. This auxiliary state can then also be used diagnostically by 274libnwam consumers to determine the current status of a link, interface, ENM 275etc. 276 277WiFi links present a problem however. On the one hand, we want them 278to be inactive when they are not part of the current priority grouping, 279while on the other we want to watch out for new WLANs appearing in 280scan data if the WiFi link is of a higher priority than the currently-selected 281group. The reason we watch out for these is they represent the potential 282to change priority grouping to a more preferred group. To accommodate this, 283WiFi links of the same or lower (more preferred) priority group will always 284be trying to connect (and thus be offline* if they cannot). 285 286It might appear unnecessary to have a separate state value/machine for 287auxiliary state - why can't we simply add the auxiliary state machine to the 288global object state machine? Part of the answer is that there are times we 289need to run through the same configuration state machine when the global 290object state is different - in paticular either offline* or online. Consider 291WiFi - we want to do periodic scans to find a "better" WLAN - we can easily 292do this by running back through the link state machine of auxiliary 293states, but we want to stay online while we do it, since we are still 294connected (if the WLAN disconnects of course we go to LINK_DOWN and offline). 295 296Another reason we wish to separate the more general states (offline, online 297etc) from the more specific ones (WIFI_NEED_SELECTION etc) is to ensure 298that the representation of configuration objects closely matches the way 299SMF works. 300 301For an NCU physical link, the following link-specific auxiliary states are 302used: 303 304Auxiliary state Description 305=============== =========== 306 307LINK_WIFI_SCANNING Scan in progress 308LINK_WIFI_NEED_SELECTION Need user to specify WLAN 309LINK_WIFI_NEED_KEY Need user to specify a WLAN key for selection 310LINK_WIFI_CONNECTING Connecting to current selection 311 312A WiFI link differs from a wired one in that it always has the 313potential to be available - it just depends if visited WLANs are in range. 314So such links - if they are higher in the priority grouping than the 315currently-active priority group - should always be able to scan, as they 316are always "trying" to be activated. 317 318Wired links that do not support DL_NOTE_LINK_UP/DOWN are problematic, 319since we have to simply assume a cable is plugged in. If an IP NCU 320is activated above such a link, and that NCU uses DHCP, a timeout 321will be triggered eventually (user-configurable via the nwamd/ncu_wait_time 322SMF property of the network/physical:nwam instance) which will cause 323us to give up on the link. 324 325For an IP interface NCU, the following auxiliary states are suggested. 326 327Auxiliary state Description 328=============== =========== 329 330NWAM_AUX_STATE_IF_WAITING_FOR_ADDR Waiting for an address to be assigned 331NWAM_AUX_STATE_IF_DHCP_TIMED_OUT DHCP timed out on interface 332 333A link can have multiple logical interfaces plumbed on it consisting 334of a mix of static and DHCP-acquired addresses. This means that 335we need to decide how to aggregate the state of these logical 336interfaces into the NCU state. The concept of "up" we use here 337does not correspond to IFF_UP or IFF_RUNNING, but rather 338when we get (via getting RTM_NEWADDR events with non-zero 339addresses) at least one address assigned to the link. 340 341We use this concept of up as it represents the potential for 342network communication - e.g. after assigning a static 343address, if the location specifies nameserver etc, it 344is possible to communicate over the network. One important 345edge case here is that when DHCP information comes 346in, we need to reassess location activation conditions and 347possibly change or reapply the current location. The problem 348is that if we have a static/DHCP mix, and if we rely on 349the IP interface's notion of "up" to trigger location activation, 350we will likely first apply the location when the static address 351has been assigned and before the DHCP information has 352been returned (which may include nameserver info). So 353the solution is that on getting an RTM_NEWADDR, we 354check if the (logical) interface associated is DHCP, and 355even if the interface NCU is already up, we reassess 356location activation. This will lead to a reapplication of 357the current location or possibly a location switch. 358 359In order to move through the various states, a generic 360API is supplied 361 362nwam_error_t 363nwamd_object_set_state(nwamd_object_t obj, nwamd_state_t state, 364 nwamd_aux_state_t aux_state); 365 366This function creates an OBJECT_STATE event containing 367the new state/aux_state and enqueues it in the event 368queue. Each object registers its own handler for this 369event, and in response to the current state/aux state and 370desired aux state it responds appropriately in the event 371handling thread, spawning other threads to carry out 372actions as appropriate. The object state event is 373then sent to any registered listeners. 374 375So for NCUs, we define a handle_object_state() function 376to run the state machine for the NCU object. 377 378Link state and NCP policy 379========================= 380 381NCPs can be either: 382 383o prioritized: where the constituent link NCUs specify priority group 384 numbers (where lower are more favoured) and grouping types. These 385 are used to allow link NCUs to be either grouped separately (exclusive) 386 or together (shared or all). 387o manual: their activation is governed by the value of their enabled 388 property. 389o a combination of the above. 390 391IP interface NCUs interit their activation from the links below them, 392so an IP interface NCU will be active if its underlying link is (assuming 393it hasn't been disabled). 394 395At startup, and at regular intervals (often triggered by NWAM 396events), the NCP policy needs to be reassessed. There 397are a number of causes for NCP policy to be reassessed - 398 399o a periodic check of link state that occurs every N seconds 400o a link goes from offline(*) to online (cable plug/wifi connect) 401o a link goes from online to offline (cable unplug/wifi disconnect). 402 403Any of these should cause the link selecton algorithm to rerun. 404 405The link selection algorithm works as follows: 406 407Starting from the lowest priority grouping value, assess all links 408in that priority group. 409 410The current priority-group is considered failed if: 411 412o "exclusive" NCUs exist and none are offline*/online, 413o "shared" NCUs exist and none are offline*/online, 414o "all" NCUs exist and all are not offline*/online, 415o no NCUs are offline*/online. 416 417We do not invalidate a link that is offline* since its configuration 418is in progress. This has the unfortunate side-effect that 419wired links that do not do DL_NOTE_LINK_UP/DOWN will never 420fail. If such links wish to be skipped, their priority group value 421should be increased (prioritizing wireless links). 422 423One a priority group has been selected, all links in groups above 424_and_ below it need to be moved offline. 425 426Location Activation 427=================== 428A basic set of system-supplied locations are supplied - NoNet and 429Automatic. nwamd will apply the NoNet location until such a time 430as an interface NCU is online, at which point it will switch 431to the Automatic location. If a user-supplied location is supplied, 432and it is either manually enabled or its conditions are satisfied, it 433will be preferred and activated instead. Only one location can be 434active at once since each location has its own specification of nameservices 435etc. 436 437ENM Activation 438============== 439ENMs are either manual or conditional in activation and will be 440activated if they are enabled (manual) or if the conditions 441are met (conditional). Multiple ENMs can be active at once. 442