1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 33.\" $FreeBSD$ 34.\" 35.Dd August 25, 2005 36.Dt TCP 4 37.Os 38.Sh NAME 39.Nm tcp 40.Nd Internet Transmission Control Protocol 41.Sh SYNOPSIS 42.In sys/types.h 43.In sys/socket.h 44.In netinet/in.h 45.Ft int 46.Fn socket AF_INET SOCK_STREAM 0 47.Sh DESCRIPTION 48The 49.Tn TCP 50protocol provides reliable, flow-controlled, two-way 51transmission of data. 52It is a byte-stream protocol used to 53support the 54.Dv SOCK_STREAM 55abstraction. 56.Tn TCP 57uses the standard 58Internet address format and, in addition, provides a per-host 59collection of 60.Dq "port addresses" . 61Thus, each address is composed 62of an Internet address specifying the host and network, 63with a specific 64.Tn TCP 65port on the host identifying the peer entity. 66.Pp 67Sockets utilizing the 68.Tn TCP 69protocol are either 70.Dq active 71or 72.Dq passive . 73Active sockets initiate connections to passive 74sockets. 75By default, 76.Tn TCP 77sockets are created active; to create a 78passive socket, the 79.Xr listen 2 80system call must be used 81after binding the socket with the 82.Xr bind 2 83system call. 84Only passive sockets may use the 85.Xr accept 2 86call to accept incoming connections. 87Only active sockets may use the 88.Xr connect 2 89call to initiate connections. 90.Pp 91Passive sockets may 92.Dq underspecify 93their location to match 94incoming connection requests from multiple networks. 95This technique, termed 96.Dq "wildcard addressing" , 97allows a single 98server to provide service to clients on multiple networks. 99To create a socket which listens on all networks, the Internet 100address 101.Dv INADDR_ANY 102must be bound. 103The 104.Tn TCP 105port may still be specified 106at this time; if the port is not specified, the system will assign one. 107Once a connection has been established, the socket's address is 108fixed by the peer entity's location. 109The address assigned to the 110socket is the address associated with the network interface 111through which packets are being transmitted and received. 112Normally, this address corresponds to the peer entity's network. 113.Pp 114.Tn TCP 115supports a number of socket options which can be set with 116.Xr setsockopt 2 117and tested with 118.Xr getsockopt 2 : 119.Bl -tag -width ".Dv TCP_NODELAY" 120.It Dv TCP_NODELAY 121Under most circumstances, 122.Tn TCP 123sends data when it is presented; 124when outstanding data has not yet been acknowledged, it gathers 125small amounts of output to be sent in a single packet once 126an acknowledgement is received. 127For a small number of clients, such as window systems 128that send a stream of mouse events which receive no replies, 129this packetization may cause significant delays. 130The boolean option 131.Dv TCP_NODELAY 132defeats this algorithm. 133.It Dv TCP_MAXSEG 134By default, a sender- and 135.No receiver- Ns Tn TCP 136will negotiate among themselves to determine the maximum segment size 137to be used for each connection. 138The 139.Dv TCP_MAXSEG 140option allows the user to determine the result of this negotiation, 141and to reduce it if desired. 142.It Dv TCP_NOOPT 143.Tn TCP 144usually sends a number of options in each packet, corresponding to 145various 146.Tn TCP 147extensions which are provided in this implementation. 148The boolean option 149.Dv TCP_NOOPT 150is provided to disable 151.Tn TCP 152option use on a per-connection basis. 153.It Dv TCP_NOPUSH 154By convention, the 155.No sender- Ns Tn TCP 156will set the 157.Dq push 158bit, and begin transmission immediately (if permitted) at the end of 159every user call to 160.Xr write 2 161or 162.Xr writev 2 . 163When this option is set to a non-zero value, 164.Tn TCP 165will delay sending any data at all until either the socket is closed, 166or the internal send buffer is filled. 167.It Dv TCP_MD5SIG 168This option enables the use of MD5 digests (also known as TCP-MD5) 169on writes to the specified socket. 170In the current release, only outgoing traffic is digested; 171digests on incoming traffic are not verified. 172The current default behavior for the system is to respond to a system 173advertising this option with TCP-MD5; this may change. 174.Pp 175One common use for this in a 176.Fx 177router deployment is to enable 178based routers to interwork with Cisco equipment at peering points. 179Support for this feature conforms to RFC 2385. 180Only IPv4 181.Pq Dv AF_INET 182sessions are supported. 183.Pp 184In order for this option to function correctly, it is necessary for the 185administrator to add a tcp-md5 key entry to the system's security 186associations database (SADB) using the 187.Xr setkey 8 188utility. 189This entry must have an SPI of 0x1000 and can therefore only be specified 190on a per-host basis at this time. 191.Pp 192If an SADB entry cannot be found for the destination, the outgoing traffic 193will have an invalid digest option prepended, and the following error message 194will be visible on the system console: 195.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 196.El 197.Pp 198The option level for the 199.Xr setsockopt 2 200call is the protocol number for 201.Tn TCP , 202available from 203.Xr getprotobyname 3 , 204or 205.Dv IPPROTO_TCP . 206All options are declared in 207.In netinet/tcp.h . 208.Pp 209Options at the 210.Tn IP 211transport level may be used with 212.Tn TCP ; 213see 214.Xr ip 4 . 215Incoming connection requests that are source-routed are noted, 216and the reverse source route is used in responding. 217.Ss MIB Variables 218The 219.Tn TCP 220protocol implements a number of variables in the 221.Va net.inet.tcp 222branch of the 223.Xr sysctl 3 224MIB. 225.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 226.It Dv TCPCTL_DO_RFC1323 227.Pq Va rfc1323 228Implement the window scaling and timestamp options of RFC 1323 229(default is true). 230.It Dv TCPCTL_MSSDFLT 231.Pq Va mssdflt 232The default value used for the maximum segment size 233.Pq Dq MSS 234when no advice to the contrary is received from MSS negotiation. 235.It Dv TCPCTL_SENDSPACE 236.Pq Va sendspace 237Maximum 238.Tn TCP 239send window. 240.It Dv TCPCTL_RECVSPACE 241.Pq Va recvspace 242Maximum 243.Tn TCP 244receive window. 245.It Va log_in_vain 246Log any connection attempts to ports where there is not a socket 247accepting connections. 248The value of 1 limits the logging to 249.Tn SYN 250(connection establishment) packets only. 251That of 2 results in any 252.Tn TCP 253packets to closed ports being logged. 254Any value unlisted above disables the logging 255(default is 0, i.e., the logging is disabled). 256.It Va slowstart_flightsize 257The number of packets allowed to be in-flight during the 258.Tn TCP 259slow-start phase on a non-local network. 260.It Va local_slowstart_flightsize 261The number of packets allowed to be in-flight during the 262.Tn TCP 263slow-start phase to local machines in the same subnet. 264.It Va msl 265The Maximum Segment Lifetime, in milliseconds, for a packet. 266.It Va keepinit 267Timeout, in milliseconds, for new, non-established 268.Tn TCP 269connections. 270.It Va keepidle 271Amount of time, in milliseconds, that the connection must be idle 272before keepalive probes (if enabled) are sent. 273.It Va keepintvl 274The interval, in milliseconds, between keepalive probes sent to remote 275machines. 276After 277.Dv TCPTV_KEEPCNT 278(default 8) probes are sent, with no response, the connection is dropped. 279.It Va always_keepalive 280Assume that 281.Dv SO_KEEPALIVE 282is set on all 283.Tn TCP 284connections, the kernel will 285periodically send a packet to the remote host to verify the connection 286is still up. 287.It Va icmp_may_rst 288Certain 289.Tn ICMP 290unreachable messages may abort connections in 291.Tn SYN-SENT 292state. 293.It Va do_tcpdrain 294Flush packets in the 295.Tn TCP 296reassembly queue if the system is low on mbufs. 297.It Va blackhole 298If enabled, disable sending of RST when a connection is attempted 299to a port where there is not a socket accepting connections. 300See 301.Xr blackhole 4 . 302.It Va delayed_ack 303Delay ACK to try and piggyback it onto a data packet. 304.It Va delacktime 305Maximum amount of time, in milliseconds, before a delayed ACK is sent. 306.It Va newreno 307Enable 308.Tn TCP 309NewReno Fast Recovery algorithm, 310as described in RFC 2582. 311.It Va path_mtu_discovery 312Enable Path MTU Discovery. 313.It Va tcbhashsize 314Size of the 315.Tn TCP 316control-block hash table 317(read-only). 318This may be tuned using the kernel option 319.Dv TCBHASHSIZE 320or by setting 321.Va net.inet.tcp.tcbhashsize 322in the 323.Xr loader 8 . 324.It Va pcbcount 325Number of active process control blocks 326(read-only). 327.It Va syncookies 328Determines whether or not 329.Tn SYN 330cookies should be generated for outbound 331.Tn SYN-ACK 332packets. 333.Tn SYN 334cookies are a great help during 335.Tn SYN 336flood attacks, and are enabled by default. 337(See 338.Xr syncookies 4 . ) 339.It Va isn_reseed_interval 340The interval (in seconds) specifying how often the secret data used in 341RFC 1948 initial sequence number calculations should be reseeded. 342By default, this variable is set to zero, indicating that 343no reseeding will occur. 344Reseeding should not be necessary, and will break 345.Dv TIME_WAIT 346recycling for a few minutes. 347.It Va rexmit_min , rexmit_slop 348Adjust the retransmit timer calculation for 349.Tn TCP . 350The slop is 351typically added to the raw calculation to take into account 352occasional variances that the 353.Tn SRTT 354(smoothed round-trip time) 355is unable to accommodate, while the minimum specifies an 356absolute minimum. 357While a number of 358.Tn TCP 359RFCs suggest a 1 360second minimum, these RFCs tend to focus on streaming behavior, 361and fail to deal with the fact that a 1 second minimum has severe 362detrimental effects over lossy interactive connections, such 363as a 802.11b wireless link, and over very fast but lossy 364connections for those cases not covered by the fast retransmit 365code. 366For this reason, we use 200ms of slop and a near-0 367minimum, which gives us an effective minimum of 200ms (similar to 368.Tn Linux ) . 369.It Va inflight.enable 370Enable 371.Tn TCP 372bandwidth-delay product limiting. 373An attempt will be made to calculate 374the bandwidth-delay product for each individual 375.Tn TCP 376connection, and limit 377the amount of inflight data being transmitted, to avoid building up 378unnecessary packets in the network. 379This option is recommended if you 380are serving a lot of data over connections with high bandwidth-delay 381products, such as modems, GigE links, and fast long-haul WANs, and/or 382you have configured your machine to accommodate large 383.Tn TCP 384windows. 385In such 386situations, without this option, you may experience high interactive 387latencies or packet loss due to the overloading of intermediate routers 388and switches. 389Note that bandwidth-delay product limiting only effects 390the transmit side of a 391.Tn TCP 392connection. 393.It Va inflight.debug 394Enable debugging for the bandwidth-delay product algorithm. 395.It Va inflight.min 396This puts a lower bound on the bandwidth-delay product window, in bytes. 397A value of 1024 is typically used for debugging. 3986000-16000 is more typical in a production installation. 399Setting this value too low may result in 400slow ramp-up times for bursty connections. 401Setting this value too high effectively disables the algorithm. 402.It Va inflight.max 403This puts an upper bound on the bandwidth-delay product window, in bytes. 404This value should not generally be modified, but may be used to set a 405global per-connection limit on queued data, potentially allowing you to 406intentionally set a less than optimum limit, to smooth data flow over a 407network while still being able to specify huge internal 408.Tn TCP 409buffers. 410.It Va inflight.stab 411The bandwidth-delay product algorithm requires a slightly larger window 412than it otherwise calculates for stability. 413This parameter determines the extra window in maximal packets / 10. 414The default value of 20 represents 2 maximal packets. 415Reducing this value is not recommended, but you may 416come across a situation with very slow links where the 417.Xr ping 8 418time 419reduction of the default inflight code is not sufficient. 420If this case occurs, you should first try reducing 421.Va inflight.min 422and, if that does not 423work, reduce both 424.Va inflight.min 425and 426.Va inflight.stab , 427trying values of 42815, 10, or 5 for the latter. 429Never use a value less than 5. 430Reducing 431.Va inflight.stab 432can lead to upwards of a 20% underutilization of the link 433as well as reducing the algorithm's ability to adapt to changing 434situations and should only be done as a last resort. 435.It Va rfc3042 436Enable the Limited Transmit algorithm as described in RFC 3042. 437It helps avoid timeouts on lossy links and also when the congestion window 438is small, as happens on short transfers. 439.It Va rfc3390 440Enable support for RFC 3390, which allows for a variable-sized 441starting congestion window on new connections, depending on the 442maximum segment size. 443This helps throughput in general, but 444particularly affects short transfers and high-bandwidth large 445propagation-delay connections. 446.Pp 447When this feature is enabled, the 448.Va slowstart_flightsize 449and 450.Va local_slowstart_flightsize 451settings are not observed for new 452connection slow starts, but they are still used for slow starts 453that occur when the connection has been idle and starts sending 454again. 455.It Va sack.enable 456Enable support for RFC 2018, TCP Selective Acknowledgment option, 457which allows the receiver to inform the sender about all successfully 458arrived segments, allowing the sender to retransmit the missing segments 459only. 460.It Va sack.initburst 461Control the number of SACK retransmissions done upon initiation of SACK 462recovery. 463.El 464.Sh ERRORS 465A socket operation may fail with one of the following errors returned: 466.Bl -tag -width Er 467.It Bq Er EISCONN 468when trying to establish a connection on a socket which 469already has one; 470.It Bq Er ENOBUFS 471when the system runs out of memory for 472an internal data structure; 473.It Bq Er ETIMEDOUT 474when a connection was dropped 475due to excessive retransmissions; 476.It Bq Er ECONNRESET 477when the remote peer 478forces the connection to be closed; 479.It Bq Er ECONNREFUSED 480when the remote 481peer actively refuses connection establishment (usually because 482no process is listening to the port); 483.It Bq Er EADDRINUSE 484when an attempt 485is made to create a socket with a port which has already been 486allocated; 487.It Bq Er EADDRNOTAVAIL 488when an attempt is made to create a 489socket with a network address for which no network interface 490exists; 491.It Bq Er EAFNOSUPPORT 492when an attempt is made to bind or connect a socket to a multicast 493address. 494.El 495.Sh SEE ALSO 496.Xr getsockopt 2 , 497.Xr socket 2 , 498.Xr sysctl 3 , 499.Xr blackhole 4 , 500.Xr inet 4 , 501.Xr intro 4 , 502.Xr ip 4 , 503.Xr syncache 4 , 504.Xr setkey 8 505.Rs 506.%A "V. Jacobson" 507.%A "R. Braden" 508.%A "D. Borman" 509.%T "TCP Extensions for High Performance" 510.%O "RFC 1323" 511.Re 512.Rs 513.%A "A. Heffernan" 514.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 515.%O "RFC 2385" 516.Re 517.Sh HISTORY 518The 519.Tn TCP 520protocol appeared in 521.Bx 4.2 . 522The RFC 1323 extensions for window scaling and timestamps were added 523in 524.Bx 4.4 . 525