1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 33.\" $FreeBSD$ 34.\" 35.Dd November 2, 2004 36.Dt TCP 4 37.Os 38.Sh NAME 39.Nm tcp 40.Nd Internet Transmission Control Protocol 41.Sh SYNOPSIS 42.In sys/types.h 43.In sys/socket.h 44.In netinet/in.h 45.Ft int 46.Fn socket AF_INET SOCK_STREAM 0 47.Sh DESCRIPTION 48The 49.Tn TCP 50protocol provides reliable, flow-controlled, two-way 51transmission of data. 52It is a byte-stream protocol used to 53support the 54.Dv SOCK_STREAM 55abstraction. 56.Tn TCP 57uses the standard 58Internet address format and, in addition, provides a per-host 59collection of 60.Dq "port addresses" . 61Thus, each address is composed 62of an Internet address specifying the host and network, 63with a specific 64.Tn TCP 65port on the host identifying the peer entity. 66.Pp 67Sockets utilizing the 68.Tn TCP 69protocol are either 70.Dq active 71or 72.Dq passive . 73Active sockets initiate connections to passive 74sockets. 75By default, 76.Tn TCP 77sockets are created active; to create a 78passive socket, the 79.Xr listen 2 80system call must be used 81after binding the socket with the 82.Xr bind 2 83system call. 84Only passive sockets may use the 85.Xr accept 2 86call to accept incoming connections. 87Only active sockets may use the 88.Xr connect 2 89call to initiate connections. 90.Tn TCP 91also supports a more datagram-like mode, called Transaction 92.Tn TCP , 93which is described in 94.Xr ttcp 4 . 95.Pp 96Passive sockets may 97.Dq underspecify 98their location to match 99incoming connection requests from multiple networks. 100This technique, termed 101.Dq "wildcard addressing" , 102allows a single 103server to provide service to clients on multiple networks. 104To create a socket which listens on all networks, the Internet 105address 106.Dv INADDR_ANY 107must be bound. 108The 109.Tn TCP 110port may still be specified 111at this time; if the port is not specified, the system will assign one. 112Once a connection has been established, the socket's address is 113fixed by the peer entity's location. 114The address assigned to the 115socket is the address associated with the network interface 116through which packets are being transmitted and received. 117Normally, this address corresponds to the peer entity's network. 118.Pp 119.Tn TCP 120supports a number of socket options which can be set with 121.Xr setsockopt 2 122and tested with 123.Xr getsockopt 2 : 124.Bl -tag -width ".Dv TCP_NODELAY" 125.It Dv TCP_NODELAY 126Under most circumstances, 127.Tn TCP 128sends data when it is presented; 129when outstanding data has not yet been acknowledged, it gathers 130small amounts of output to be sent in a single packet once 131an acknowledgement is received. 132For a small number of clients, such as window systems 133that send a stream of mouse events which receive no replies, 134this packetization may cause significant delays. 135The boolean option 136.Dv TCP_NODELAY 137defeats this algorithm. 138.It Dv TCP_MAXSEG 139By default, a sender- and 140.No receiver- Ns Tn TCP 141will negotiate among themselves to determine the maximum segment size 142to be used for each connection. 143The 144.Dv TCP_MAXSEG 145option allows the user to determine the result of this negotiation, 146and to reduce it if desired. 147.It Dv TCP_NOOPT 148.Tn TCP 149usually sends a number of options in each packet, corresponding to 150various 151.Tn TCP 152extensions which are provided in this implementation. 153The boolean option 154.Dv TCP_NOOPT 155is provided to disable 156.Tn TCP 157option use on a per-connection basis. 158.It Dv TCP_NOPUSH 159By convention, the 160.No sender- Ns Tn TCP 161will set the 162.Dq push 163bit, and begin transmission immediately (if permitted) at the end of 164every user call to 165.Xr write 2 166or 167.Xr writev 2 . 168The 169.Dv TCP_NOPUSH 170option is provided to allow servers to easily make use of Transaction 171.Tn TCP 172(see 173.Xr ttcp 4 ) . 174When this option is set to a non-zero value, 175.Tn TCP 176will delay sending any data at all until either the socket is closed, 177or the internal send buffer is filled. 178.It Dv TCP_MD5SIG 179This option enables the use of MD5 digests (also known as TCP-MD5) 180on writes to the specified socket. 181In the current release, only outgoing traffic is digested; 182digests on incoming traffic are not verified. 183The current default behavior for the system is to respond to a system 184advertising this option with TCP-MD5; this may change. 185.Pp 186One common use for this in a 187.Fx 188router deployment is to enable 189based routers to interwork with Cisco equipment at peering points. 190Support for this feature conforms to RFC 2385. 191Only IPv4 192.Pq Dv AF_INET 193sessions are supported. 194.Pp 195In order for this option to function correctly, it is necessary for the 196administrator to add a tcp-md5 key entry to the system's security 197associations database (SADB) using the 198.Xr setkey 8 199utility. 200This entry must have an SPI of 0x1000 and can therefore only be specified 201on a per-host basis at this time. 202.Pp 203If an SADB entry cannot be found for the destination, the outgoing traffic 204will have an invalid digest option prepended, and the following error message 205will be visible on the system console: 206.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 207.El 208.Pp 209The option level for the 210.Xr setsockopt 2 211call is the protocol number for 212.Tn TCP , 213available from 214.Xr getprotobyname 3 , 215or 216.Dv IPPROTO_TCP . 217All options are declared in 218.In netinet/tcp.h . 219.Pp 220Options at the 221.Tn IP 222transport level may be used with 223.Tn TCP ; 224see 225.Xr ip 4 . 226Incoming connection requests that are source-routed are noted, 227and the reverse source route is used in responding. 228.Ss MIB Variables 229The 230.Tn TCP 231protocol implements a number of variables in the 232.Va net.inet.tcp 233branch of the 234.Xr sysctl 3 235MIB. 236.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 237.It Dv TCPCTL_DO_RFC1323 238.Pq Va rfc1323 239Implement the window scaling and timestamp options of RFC 1323 240(default is true). 241.It Dv TCPCTL_MSSDFLT 242.Pq Va mssdflt 243The default value used for the maximum segment size 244.Pq Dq MSS 245when no advice to the contrary is received from MSS negotiation. 246.It Dv TCPCTL_SENDSPACE 247.Pq Va sendspace 248Maximum 249.Tn TCP 250send window. 251.It Dv TCPCTL_RECVSPACE 252.Pq Va recvspace 253Maximum 254.Tn TCP 255receive window. 256.It Va log_in_vain 257Log any connection attempts to ports where there is not a socket 258accepting connections. 259The value of 1 limits the logging to 260.Tn SYN 261(connection establishment) packets only. 262That of 2 results in any 263.Tn TCP 264packets to closed ports being logged. 265Any value unlisted above disables the logging 266(default is 0, i.e., the logging is disabled). 267.It Va slowstart_flightsize 268The number of packets allowed to be in-flight during the 269.Tn TCP 270slow-start phase on a non-local network. 271.It Va local_slowstart_flightsize 272The number of packets allowed to be in-flight during the 273.Tn TCP 274slow-start phase to local machines in the same subnet. 275.It Va msl 276The Maximum Segment Lifetime, in milliseconds, for a packet. 277.It Va keepinit 278Timeout, in milliseconds, for new, non-established 279.Tn TCP 280connections. 281.It Va keepidle 282Amount of time, in milliseconds, that the connection must be idle 283before keepalive probes (if enabled) are sent. 284.It Va keepintvl 285The interval, in milliseconds, between keepalive probes sent to remote 286machines. 287After 288.Dv TCPTV_KEEPCNT 289(default 8) probes are sent, with no response, the connection is dropped. 290.It Va always_keepalive 291Assume that 292.Dv SO_KEEPALIVE 293is set on all 294.Tn TCP 295connections, the kernel will 296periodically send a packet to the remote host to verify the connection 297is still up. 298.It Va icmp_may_rst 299Certain 300.Tn ICMP 301unreachable messages may abort connections in 302.Tn SYN-SENT 303state. 304.It Va do_tcpdrain 305Flush packets in the 306.Tn TCP 307reassembly queue if the system is low on mbufs. 308.It Va blackhole 309If enabled, disable sending of RST when a connection is attempted 310to a port where there is not a socket accepting connections. 311See 312.Xr blackhole 4 . 313.It Va delayed_ack 314Delay ACK to try and piggyback it onto a data packet. 315.It Va delacktime 316Maximum amount of time, in milliseconds, before a delayed ACK is sent. 317.It Va newreno 318Enable 319.Tn TCP 320NewReno Fast Recovery algorithm, 321as described in RFC 2582. 322.It Va path_mtu_discovery 323Enable Path MTU Discovery. 324.It Va tcbhashsize 325Size of the 326.Tn TCP 327control-block hash table 328(read-only). 329This may be tuned using the kernel option 330.Dv TCBHASHSIZE 331or by setting 332.Va net.inet.tcp.tcbhashsize 333in the 334.Xr loader 8 . 335.It Va pcbcount 336Number of active process control blocks 337(read-only). 338.It Va syncookies 339Determines whether or not 340.Tn SYN 341cookies should be generated for outbound 342.Tn SYN-ACK 343packets. 344.Tn SYN 345cookies are a great help during 346.Tn SYN 347flood attacks, and are enabled by default. 348(See 349.Xr syncookies 4 . ) 350.It Va isn_reseed_interval 351The interval (in seconds) specifying how often the secret data used in 352RFC 1948 initial sequence number calculations should be reseeded. 353By default, this variable is set to zero, indicating that 354no reseeding will occur. 355Reseeding should not be necessary, and will break 356.Dv TIME_WAIT 357recycling for a few minutes. 358.It Va rexmit_min , rexmit_slop 359Adjust the retransmit timer calculation for 360.Tn TCP . 361The slop is 362typically added to the raw calculation to take into account 363occasional variances that the 364.Tn SRTT 365(smoothed round-trip time) 366is unable to accommodate, while the minimum specifies an 367absolute minimum. 368While a number of 369.Tn TCP 370RFCs suggest a 1 371second minimum, these RFCs tend to focus on streaming behavior, 372and fail to deal with the fact that a 1 second minimum has severe 373detrimental effects over lossy interactive connections, such 374as a 802.11b wireless link, and over very fast but lossy 375connections for those cases not covered by the fast retransmit 376code. 377For this reason, we use 200ms of slop and a near-0 378minimum, which gives us an effective minimum of 200ms (similar to 379.Tn Linux ) . 380.It Va inflight.enable 381Enable 382.Tn TCP 383bandwidth-delay product limiting. 384An attempt will be made to calculate 385the bandwidth-delay product for each individual 386.Tn TCP 387connection, and limit 388the amount of inflight data being transmitted, to avoid building up 389unnecessary packets in the network. 390This option is recommended if you 391are serving a lot of data over connections with high bandwidth-delay 392products, such as modems, GigE links, and fast long-haul WANs, and/or 393you have configured your machine to accommodate large 394.Tn TCP 395windows. 396In such 397situations, without this option, you may experience high interactive 398latencies or packet loss due to the overloading of intermediate routers 399and switches. 400Note that bandwidth-delay product limiting only effects 401the transmit side of a 402.Tn TCP 403connection. 404.It Va inflight.debug 405Enable debugging for the bandwidth-delay product algorithm. 406.It Va inflight.min 407This puts a lower bound on the bandwidth-delay product window, in bytes. 408A value of 1024 is typically used for debugging. 4096000-16000 is more typical in a production installation. 410Setting this value too low may result in 411slow ramp-up times for bursty connections. 412Setting this value too high effectively disables the algorithm. 413.It Va inflight.max 414This puts an upper bound on the bandwidth-delay product window, in bytes. 415This value should not generally be modified, but may be used to set a 416global per-connection limit on queued data, potentially allowing you to 417intentionally set a less than optimum limit, to smooth data flow over a 418network while still being able to specify huge internal 419.Tn TCP 420buffers. 421.It Va inflight.stab 422The bandwidth-delay product algorithm requires a slightly larger window 423than it otherwise calculates for stability. 424This parameter determines the extra window in maximal packets / 10. 425The default value of 20 represents 2 maximal packets. 426Reducing this value is not recommended, but you may 427come across a situation with very slow links where the 428.Xr ping 8 429time 430reduction of the default inflight code is not sufficient. 431If this case occurs, you should first try reducing 432.Va inflight.min 433and, if that does not 434work, reduce both 435.Va inflight.min 436and 437.Va inflight.stab , 438trying values of 43915, 10, or 5 for the latter. 440Never use a value less than 5. 441Reducing 442.Va inflight.stab 443can lead to upwards of a 20% underutilization of the link 444as well as reducing the algorithm's ability to adapt to changing 445situations and should only be done as a last resort. 446.It Va rfc3042 447Enable the Limited Transmit algorithm as described in RFC 3042. 448It helps avoid timeouts on lossy links and also when the congestion window 449is small, as happens on short transfers. 450.It Va rfc3390 451Enable support for RFC 3390, which allows for a variable-sized 452starting congestion window on new connections, depending on the 453maximum segment size. 454This helps throughput in general, but 455particularly affects short transfers and high-bandwidth large 456propagation-delay connections. 457.Pp 458When this feature is enabled, the 459.Va slowstart_flightsize 460and 461.Va local_slowstart_flightsize 462settings are not observed for new 463connection slow starts, but they are still used for slow starts 464that occur when the connection has been idle and starts sending 465again. 466.It Va sack.enable 467Enable support for RFC 2018, TCP Selective Acknowledgment option, 468which allows the receiver to inform the sender about all successfully 469arrived segments, allowing the sender to retransmit the missing segments 470only. 471.It Va sack.initburst 472Control the number of SACK retransmissions done upon initiation of SACK 473recovery. 474.El 475.Sh ERRORS 476A socket operation may fail with one of the following errors returned: 477.Bl -tag -width Er 478.It Bq Er EISCONN 479when trying to establish a connection on a socket which 480already has one; 481.It Bq Er ENOBUFS 482when the system runs out of memory for 483an internal data structure; 484.It Bq Er ETIMEDOUT 485when a connection was dropped 486due to excessive retransmissions; 487.It Bq Er ECONNRESET 488when the remote peer 489forces the connection to be closed; 490.It Bq Er ECONNREFUSED 491when the remote 492peer actively refuses connection establishment (usually because 493no process is listening to the port); 494.It Bq Er EADDRINUSE 495when an attempt 496is made to create a socket with a port which has already been 497allocated; 498.It Bq Er EADDRNOTAVAIL 499when an attempt is made to create a 500socket with a network address for which no network interface 501exists; 502.It Bq Er EAFNOSUPPORT 503when an attempt is made to bind or connect a socket to a multicast 504address. 505.El 506.Sh SEE ALSO 507.Xr getsockopt 2 , 508.Xr socket 2 , 509.Xr sysctl 3 , 510.Xr blackhole 4 , 511.Xr inet 4 , 512.Xr intro 4 , 513.Xr ip 4 , 514.Xr syncache 4 , 515.Xr setkey 8 516.Rs 517.%A "V. Jacobson" 518.%A "R. Braden" 519.%A "D. Borman" 520.%T "TCP Extensions for High Performance" 521.%O "RFC 1323" 522.Re 523.Rs 524.%A "A. Heffernan" 525.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 526.%O "RFC 2385" 527.Re 528.Sh HISTORY 529The 530.Tn TCP 531protocol appeared in 532.Bx 4.2 . 533The RFC 1323 extensions for window scaling and timestamps were added 534in 535.Bx 4.4 . 536