1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 33.\" $FreeBSD$ 34.\" 35.Dd March 13, 2003 36.Dt TCP 4 37.Os 38.Sh NAME 39.Nm tcp 40.Nd Internet Transmission Control Protocol 41.Sh SYNOPSIS 42.In sys/types.h 43.In sys/socket.h 44.In netinet/in.h 45.Ft int 46.Fn socket AF_INET SOCK_STREAM 0 47.Sh DESCRIPTION 48The 49.Tn TCP 50protocol provides reliable, flow-controlled, two-way 51transmission of data. 52It is a byte-stream protocol used to 53support the 54.Dv SOCK_STREAM 55abstraction. 56.Tn TCP 57uses the standard 58Internet address format and, in addition, provides a per-host 59collection of 60.Dq "port addresses" . 61Thus, each address is composed 62of an Internet address specifying the host and network, 63with a specific 64.Tn TCP 65port on the host identifying the peer entity. 66.Pp 67Sockets utilizing the 68.Tn TCP 69protocol are either 70.Dq active 71or 72.Dq passive . 73Active sockets initiate connections to passive 74sockets. 75By default, 76.Tn TCP 77sockets are created active; to create a 78passive socket, the 79.Xr listen 2 80system call must be used 81after binding the socket with the 82.Xr bind 2 83system call. 84Only passive sockets may use the 85.Xr accept 2 86call to accept incoming connections. 87Only active sockets may use the 88.Xr connect 2 89call to initiate connections. 90.Tn TCP 91also supports a more datagram-like mode, called Transaction 92.Tn TCP , 93which is described in 94.Xr ttcp 4 . 95.Pp 96Passive sockets may 97.Dq underspecify 98their location to match 99incoming connection requests from multiple networks. 100This technique, termed 101.Dq "wildcard addressing" , 102allows a single 103server to provide service to clients on multiple networks. 104To create a socket which listens on all networks, the Internet 105address 106.Dv INADDR_ANY 107must be bound. 108The 109.Tn TCP 110port may still be specified 111at this time; if the port is not specified, the system will assign one. 112Once a connection has been established, the socket's address is 113fixed by the peer entity's location. 114The address assigned to the 115socket is the address associated with the network interface 116through which packets are being transmitted and received. 117Normally, this address corresponds to the peer entity's network. 118.Pp 119.Tn TCP 120supports a number of socket options which can be set with 121.Xr setsockopt 2 122and tested with 123.Xr getsockopt 2 : 124.Bl -tag -width ".Dv TCP_NODELAY" 125.It Dv TCP_NODELAY 126Under most circumstances, 127.Tn TCP 128sends data when it is presented; 129when outstanding data has not yet been acknowledged, it gathers 130small amounts of output to be sent in a single packet once 131an acknowledgement is received. 132For a small number of clients, such as window systems 133that send a stream of mouse events which receive no replies, 134this packetization may cause significant delays. 135The boolean option 136.Dv TCP_NODELAY 137defeats this algorithm. 138.It Dv TCP_MAXSEG 139By default, a sender- and 140.No receiver- Ns Tn TCP 141will negotiate among themselves to determine the maximum segment size 142to be used for each connection. 143The 144.Dv TCP_MAXSEG 145option allows the user to determine the result of this negotiation, 146and to reduce it if desired. 147.It Dv TCP_NOOPT 148.Tn TCP 149usually sends a number of options in each packet, corresponding to 150various 151.Tn TCP 152extensions which are provided in this implementation. 153The boolean option 154.Dv TCP_NOOPT 155is provided to disable 156.Tn TCP 157option use on a per-connection basis. 158.It Dv TCP_NOPUSH 159By convention, the 160.No sender- Ns Tn TCP 161will set the 162.Dq push 163bit, and begin transmission immediately (if permitted) at the end of 164every user call to 165.Xr write 2 166or 167.Xr writev 2 . 168The 169.Dv TCP_NOPUSH 170option is provided to allow servers to easily make use of Transaction 171.Tn TCP 172(see 173.Xr ttcp 4 ) . 174When this option is set to a non-zero value, 175.Tn TCP 176will delay sending any data at all until either the socket is closed, 177or the internal send buffer is filled. 178.El 179.Pp 180The option level for the 181.Xr setsockopt 2 182call is the protocol number for 183.Tn TCP , 184available from 185.Xr getprotobyname 3 , 186or 187.Dv IPPROTO_TCP . 188All options are declared in 189.In netinet/tcp.h . 190.Pp 191Options at the 192.Tn IP 193transport level may be used with 194.Tn TCP ; 195see 196.Xr ip 4 . 197Incoming connection requests that are source-routed are noted, 198and the reverse source route is used in responding. 199.Ss MIB Variables 200The 201.Tn TCP 202protocol implements a number of variables in the 203.Va net.inet.tcp 204branch of the 205.Xr sysctl 3 206MIB. 207.Bl -tag -width ".Va TCPCTL_DO_RFC1644" 208.It Dv TCPCTL_DO_RFC1323 209.Pq Va rfc1323 210Implement the window scaling and timestamp options of RFC 1323 211(default is true). 212.It Dv TCPCTL_DO_RFC1644 213.Pq Va rfc1644 214Implement Transaction 215.Tn TCP , 216as described in RFC 1644. 217.It Dv TCPCTL_MSSDFLT 218.Pq Va mssdflt 219The default value used for the maximum segment size 220.Pq Dq MSS 221when no advice to the contrary is received from MSS negotiation. 222.It Dv TCPCTL_SENDSPACE 223.Pq Va sendspace 224Maximum 225.Tn TCP 226send window. 227.It Dv TCPCTL_RECVSPACE 228.Pq Va recvspace 229Maximum 230.Tn TCP 231receive window. 232.It Va log_in_vain 233Log any connection attempts to ports where there is not a socket 234accepting connections. 235The value of 1 limits the logging to 236.Tn SYN 237(connection establishment) packets only. 238That of 2 results in any 239.Tn TCP 240packets to closed ports being logged. 241Any value unlisted above disables the logging 242(default is 0, i.e., the logging is disabled). 243.It Va slowstart_flightsize 244The number of packets allowed to be in-flight during the 245.Tn TCP 246slow-start phase on a non-local network. 247.It Va local_slowstart_flightsize 248The number of packets allowed to be in-flight during the 249.Tn TCP 250slow-start phase to local machines in the same subnet. 251.It Va msl 252The Maximum Segment Lifetime, in milliseconds, for a packet. 253.It Va keepinit 254Timeout, in milliseconds, for new, non-established 255.Tn TCP 256connections. 257.It Va keepidle 258Amount of time, in milliseconds, that the connection must be idle 259before keepalive probes (if enabled) are sent. 260.It Va keepintvl 261The interval, in milliseconds, between keepalive probes sent to remote 262machines. 263After 264.Dv TCPTV_KEEPCNT 265(default 8) probes are sent, with no response, the connection is dropped. 266.It Va always_keepalive 267Assume that 268.Dv SO_KEEPALIVE 269is set on all 270.Tn TCP 271connections, the kernel will 272periodically send a packet to the remote host to verify the connection 273is still up. 274.It Va icmp_may_rst 275Certain 276.Tn ICMP 277unreachable messages may abort connections in 278.Tn SYN-SENT 279state. 280.It Va do_tcpdrain 281Flush packets in the 282.Tn TCP 283reassembly queue if the system is low on mbufs. 284.It Va blackhole 285If enabled, disable sending of RST when a connection is attempted 286to a port where there is not a socket accepting connections. 287See 288.Xr blackhole 4 . 289.It Va delayed_ack 290Delay ACK to try and piggyback it onto a data packet. 291.It Va delacktime 292Maximum amount of time, in milliseconds, before a delayed ACK is sent. 293.It Va newreno 294Enable 295.Tn TCP 296NewReno Fast Recovery algorithm, 297as described in RFC 2582. 298.It Va path_mtu_discovery 299Enable Path MTU Discovery. 300.It Va tcbhashsize 301Size of the 302.Tn TCP 303control-block hash table 304(read-only). 305This may be tuned using the kernel option 306.Dv TCBHASHSIZE 307or by setting 308.Va net.inet.tcp.tcbhashsize 309in the 310.Xr loader 8 . 311.It Va pcbcount 312Number of active process control blocks 313(read-only). 314.It Va syncookies 315Determines whether or not 316.Tn SYN 317cookies should be generated for outbound 318.Tn SYN-ACK 319packets. 320.Tn SYN 321cookies are a great help during 322.Tn SYN 323flood attacks, and are enabled by default. 324(See 325.Xr syncookies 4 . ) 326.It Va isn_reseed_interval 327The interval (in seconds) specifying how often the secret data used in 328RFC 1948 initial sequence number calculations should be reseeded. 329By default, this variable is set to zero, indicating that 330no reseeding will occur. 331Reseeding should not be necessary, and will break 332.Dv TIME_WAIT 333recycling for a few minutes. 334.It Va rexmit_min , rexmit_slop 335Adjust the retransmit timer calculation for 336.Tn TCP . 337The slop is 338typically added to the raw calculation to take into account 339occasional variances that the 340.Tn SRTT 341(smoothed round-trip time) 342is unable to accomodate, while the minimum specifies an 343absolute minimum. 344While a number of 345.Tn TCP 346RFCs suggest a 1 347second minimum, these RFCs tend to focus on streaming behavior, 348and fail to deal with the fact that a 1 second minimum has severe 349detrimental effects over lossy interactive connections, such 350as a 802.11b wireless link, and over very fast but lossy 351connections for those cases not covered by the fast retransmit 352code. 353For this reason, we use 200ms of slop and a near-0 354minimum, which gives us an effective minimum of 200ms (similar to 355.Tn Linux ) . 356.It Va inflight_enable 357Enable 358.Tn TCP 359bandwidth-delay product limiting. 360An attempt will be made to calculate 361the bandwidth-delay product for each individual 362.Tn TCP 363connection, and limit 364the amount of inflight data being transmitted, to avoid building up 365unnecessary packets in the network. 366This option is recommended if you 367are serving a lot of data over connections with high bandwidth-delay 368products, such as modems, GigE links, and fast long-haul WANs, and/or 369you have configured your machine to accomodate large 370.Tn TCP 371windows. 372In such 373situations, without this option, you may experience high interactive 374latencies or packet loss due to the overloading of intermediate routers 375and switches. 376Note that bandwidth-delay product limiting only effects 377the transmit side of a 378.Tn TCP 379connection. 380.It Va inflight_debug 381Enable debugging for the bandwidth-delay product algorithm. 382This may 383default to on (1), so if you enable the algorithm, 384you should probably also 385disable debugging by setting this variable to 0. 386.It Va inflight_min 387This puts a lower bound on the bandwidth-delay product window, in bytes. 388A value of 1024 is typically used for debugging. 3896000-16000 is more typical in a production installation. 390Setting this value too low may result in 391slow ramp-up times for bursty connections. 392Setting this value too high effectively disables the algorithm. 393.It Va inflight_max 394This puts an upper bound on the bandwidth-delay product window, in bytes. 395This value should not generally be modified, but may be used to set a 396global per-connection limit on queued data, potentially allowing you to 397intentionally set a less than optimum limit, to smooth data flow over a 398network while still being able to specify huge internal 399.Tn TCP 400buffers. 401.It Va inflight_stab 402The bandwidth-delay product algorithm requires a slightly larger window 403than it otherwise calculates for stability. 404This parameter determines the extra window in maximal packets / 10. 405The default value of 20 represents 2 maximal packets. 406Reducing this value is not recommended, but you may 407come across a situation with very slow links where the 408.Xr ping 8 409time 410reduction of the default inflight code is not sufficient. 411If this case occurs, you should first try reducing 412.Va inflight_min 413and, if that does not 414work, reduce both 415.Va inflight_min 416and 417.Va inflight_stab , 418trying values of 41915, 10, or 5 for the latter. 420Never use a value less than 5. 421Reducing 422.Va inflight_stab 423can lead to upwards of a 20% underutilization of the link 424as well as reducing the algorithm's ability to adapt to changing 425situations and should only be done as a last resort. 426.It Va rfc3042 427Enable the Limited Transmit algorithm as described in RFC 3042. 428It 429helps avoid timeouts on lossy links and also when the congestion window 430is small, as happens on short transfers. 431This is a standards track RFC 432and is off by default. 433.It Va rfc3390 434Enable support for RFC 3390, which allows for a variable-sized 435starting congestion window on new connections, depending on the 436maximum segment size. 437This helps throughput in general, but 438particularly affects short transfers and high-bandwidth large 439propagation-delay connections. 440This is a standards track RFC and 441support for it is off by default. 442.Pp 443When this feature is enabled, the 444.Va slowstart_flightsize 445and 446.Va local_slowstart_flightsize 447settings are not observed for new 448connection slow starts, but they are still used for slow starts 449that occur when the connection has been idle and starts sending 450again. 451.El 452.Sh ERRORS 453A socket operation may fail with one of the following errors returned: 454.Bl -tag -width Er 455.It Bq Er EISCONN 456when trying to establish a connection on a socket which 457already has one; 458.It Bq Er ENOBUFS 459when the system runs out of memory for 460an internal data structure; 461.It Bq Er ETIMEDOUT 462when a connection was dropped 463due to excessive retransmissions; 464.It Bq Er ECONNRESET 465when the remote peer 466forces the connection to be closed; 467.It Bq Er ECONNREFUSED 468when the remote 469peer actively refuses connection establishment (usually because 470no process is listening to the port); 471.It Bq Er EADDRINUSE 472when an attempt 473is made to create a socket with a port which has already been 474allocated; 475.It Bq Er EADDRNOTAVAIL 476when an attempt is made to create a 477socket with a network address for which no network interface 478exists; 479.It Bq Er EAFNOSUPPORT 480when an attempt is made to bind or connect a socket to a multicast 481address. 482.El 483.Sh SEE ALSO 484.Xr getsockopt 2 , 485.Xr socket 2 , 486.Xr sysctl 3 , 487.Xr blackhole 4 , 488.Xr inet 4 , 489.Xr intro 4 , 490.Xr ip 4 , 491.Xr syncache 4 , 492.Xr ttcp 4 493.Rs 494.%A "V. Jacobson" 495.%A "R. Braden" 496.%A "D. Borman" 497.%T "TCP Extensions for High Performance" 498.%O "RFC 1323" 499.Re 500.Rs 501.%A "R. Braden" 502.%T "T/TCP \- TCP Extensions for Transactions" 503.%O "RFC 1644" 504.Re 505.Sh HISTORY 506The 507.Tn TCP 508protocol appeared in 509.Bx 4.2 . 510The RFC 1323 extensions for window scaling and timestamps were added 511in 512.Bx 4.4 . 513