1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 33.\" $FreeBSD$ 34.\" 35.Dd March 13, 2003 36.Dt TCP 4 37.Os 38.Sh NAME 39.Nm tcp 40.Nd Internet Transmission Control Protocol 41.Sh SYNOPSIS 42.In sys/types.h 43.In sys/socket.h 44.In netinet/in.h 45.Ft int 46.Fn socket AF_INET SOCK_STREAM 0 47.Sh DESCRIPTION 48The 49.Tn TCP 50protocol provides reliable, flow-controlled, two-way 51transmission of data. 52It is a byte-stream protocol used to 53support the 54.Dv SOCK_STREAM 55abstraction. 56.Tn TCP 57uses the standard 58Internet address format and, in addition, provides a per-host 59collection of 60.Dq "port addresses" . 61Thus, each address is composed 62of an Internet address specifying the host and network, 63with a specific 64.Tn TCP 65port on the host identifying the peer entity. 66.Pp 67Sockets utilizing the 68.Tn TCP 69protocol are either 70.Dq active 71or 72.Dq passive . 73Active sockets initiate connections to passive 74sockets. 75By default, 76.Tn TCP 77sockets are created active; to create a 78passive socket, the 79.Xr listen 2 80system call must be used 81after binding the socket with the 82.Xr bind 2 83system call. 84Only passive sockets may use the 85.Xr accept 2 86call to accept incoming connections. 87Only active sockets may use the 88.Xr connect 2 89call to initiate connections. 90.Tn TCP 91also supports a more datagram-like mode, called Transaction 92.Tn TCP , 93which is described in 94.Xr ttcp 4 . 95.Pp 96Passive sockets may 97.Dq underspecify 98their location to match 99incoming connection requests from multiple networks. 100This technique, termed 101.Dq "wildcard addressing" , 102allows a single 103server to provide service to clients on multiple networks. 104To create a socket which listens on all networks, the Internet 105address 106.Dv INADDR_ANY 107must be bound. 108The 109.Tn TCP 110port may still be specified 111at this time; if the port is not specified, the system will assign one. 112Once a connection has been established, the socket's address is 113fixed by the peer entity's location. 114The address assigned to the 115socket is the address associated with the network interface 116through which packets are being transmitted and received. 117Normally, this address corresponds to the peer entity's network. 118.Pp 119.Tn TCP 120supports a number of socket options which can be set with 121.Xr setsockopt 2 122and tested with 123.Xr getsockopt 2 : 124.Bl -tag -width ".Dv TCP_MD5SIG" 125.It Dv TCP_NODELAY 126Under most circumstances, 127.Tn TCP 128sends data when it is presented; 129when outstanding data has not yet been acknowledged, it gathers 130small amounts of output to be sent in a single packet once 131an acknowledgement is received. 132For a small number of clients, such as window systems 133that send a stream of mouse events which receive no replies, 134this packetization may cause significant delays. 135The boolean option 136.Dv TCP_NODELAY 137defeats this algorithm. 138.It Dv TCP_MAXSEG 139By default, a sender- and 140.No receiver- Ns Tn TCP 141will negotiate among themselves to determine the maximum segment size 142to be used for each connection. 143The 144.Dv TCP_MAXSEG 145option allows the user to determine the result of this negotiation, 146and to reduce it if desired. 147.It Dv TCP_NOOPT 148.Tn TCP 149usually sends a number of options in each packet, corresponding to 150various 151.Tn TCP 152extensions which are provided in this implementation. 153The boolean option 154.Dv TCP_NOOPT 155is provided to disable 156.Tn TCP 157option use on a per-connection basis. 158.It Dv TCP_NOPUSH 159By convention, the 160.No sender- Ns Tn TCP 161will set the 162.Dq push 163bit, and begin transmission immediately (if permitted) at the end of 164every user call to 165.Xr write 2 166or 167.Xr writev 2 . 168The 169.Dv TCP_NOPUSH 170option is provided to allow servers to easily make use of Transaction 171.Tn TCP 172(see 173.Xr ttcp 4 ) . 174When this option is set to a non-zero value, 175.Tn TCP 176will delay sending any data at all until either the socket is closed, 177or the internal send buffer is filled. 178.It Dv TCP_MD5SIG 179This option enables the use of MD5 digests (also known as TCP-MD5) 180on writes to the specified socket. 181In the current release, only outgoing traffic is digested; 182digests on incoming traffic are not verified. 183The current default behavior for the system is to respond to a system 184advertising this option with TCP-MD5; this may change. 185.Pp 186One common use for this in a FreeBSD router deployment is to enable 187based routers to interwork with Cisco equipment at peering points. 188Support for this feature conforms to RFC 2385. 189Only IPv4 (AF_INET) sessions are supported. 190.Pp 191In order for this option to function correctly, it is necessary for the 192administrator to add a tcp-md5 key entry to the system's security 193associations database (SADB) using the 194.Xr setkey 8 195utility. 196This entry must have an SPI of 0x1000 and can therefore only be specified 197on a per-host basis at this time. 198.Pp 199If an SADB entry cannot be found for the destination, the outgoing traffic 200will have an invalid digest option prepended, and the following error message 201will be visible on the system console: 202.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 203.El 204.Pp 205The option level for the 206.Xr setsockopt 2 207call is the protocol number for 208.Tn TCP , 209available from 210.Xr getprotobyname 3 , 211or 212.Dv IPPROTO_TCP . 213All options are declared in 214.In netinet/tcp.h . 215.Pp 216Options at the 217.Tn IP 218transport level may be used with 219.Tn TCP ; 220see 221.Xr ip 4 . 222Incoming connection requests that are source-routed are noted, 223and the reverse source route is used in responding. 224.Ss MIB Variables 225The 226.Tn TCP 227protocol implements a number of variables in the 228.Va net.inet.tcp 229branch of the 230.Xr sysctl 3 231MIB. 232.Bl -tag -width ".Va TCPCTL_DO_RFC1644" 233.It Dv TCPCTL_DO_RFC1323 234.Pq Va rfc1323 235Implement the window scaling and timestamp options of RFC 1323 236(default is true). 237.It Dv TCPCTL_DO_RFC1644 238.Pq Va rfc1644 239Implement Transaction 240.Tn TCP , 241as described in RFC 1644. 242.It Dv TCPCTL_MSSDFLT 243.Pq Va mssdflt 244The default value used for the maximum segment size 245.Pq Dq MSS 246when no advice to the contrary is received from MSS negotiation. 247.It Dv TCPCTL_SENDSPACE 248.Pq Va sendspace 249Maximum 250.Tn TCP 251send window. 252.It Dv TCPCTL_RECVSPACE 253.Pq Va recvspace 254Maximum 255.Tn TCP 256receive window. 257.It Va log_in_vain 258Log any connection attempts to ports where there is not a socket 259accepting connections. 260The value of 1 limits the logging to 261.Tn SYN 262(connection establishment) packets only. 263That of 2 results in any 264.Tn TCP 265packets to closed ports being logged. 266Any value unlisted above disables the logging 267(default is 0, i.e., the logging is disabled). 268.It Va slowstart_flightsize 269The number of packets allowed to be in-flight during the 270.Tn TCP 271slow-start phase on a non-local network. 272.It Va local_slowstart_flightsize 273The number of packets allowed to be in-flight during the 274.Tn TCP 275slow-start phase to local machines in the same subnet. 276.It Va msl 277The Maximum Segment Lifetime, in milliseconds, for a packet. 278.It Va keepinit 279Timeout, in milliseconds, for new, non-established 280.Tn TCP 281connections. 282.It Va keepidle 283Amount of time, in milliseconds, that the connection must be idle 284before keepalive probes (if enabled) are sent. 285.It Va keepintvl 286The interval, in milliseconds, between keepalive probes sent to remote 287machines. 288After 289.Dv TCPTV_KEEPCNT 290(default 8) probes are sent, with no response, the connection is dropped. 291.It Va always_keepalive 292Assume that 293.Dv SO_KEEPALIVE 294is set on all 295.Tn TCP 296connections, the kernel will 297periodically send a packet to the remote host to verify the connection 298is still up. 299.It Va icmp_may_rst 300Certain 301.Tn ICMP 302unreachable messages may abort connections in 303.Tn SYN-SENT 304state. 305.It Va do_tcpdrain 306Flush packets in the 307.Tn TCP 308reassembly queue if the system is low on mbufs. 309.It Va blackhole 310If enabled, disable sending of RST when a connection is attempted 311to a port where there is not a socket accepting connections. 312See 313.Xr blackhole 4 . 314.It Va delayed_ack 315Delay ACK to try and piggyback it onto a data packet. 316.It Va delacktime 317Maximum amount of time, in milliseconds, before a delayed ACK is sent. 318.It Va newreno 319Enable 320.Tn TCP 321NewReno Fast Recovery algorithm, 322as described in RFC 2582. 323.It Va path_mtu_discovery 324Enable Path MTU Discovery. 325.It Va tcbhashsize 326Size of the 327.Tn TCP 328control-block hash table 329(read-only). 330This may be tuned using the kernel option 331.Dv TCBHASHSIZE 332or by setting 333.Va net.inet.tcp.tcbhashsize 334in the 335.Xr loader 8 . 336.It Va pcbcount 337Number of active process control blocks 338(read-only). 339.It Va syncookies 340Determines whether or not 341.Tn SYN 342cookies should be generated for outbound 343.Tn SYN-ACK 344packets. 345.Tn SYN 346cookies are a great help during 347.Tn SYN 348flood attacks, and are enabled by default. 349(See 350.Xr syncookies 4 . ) 351.It Va isn_reseed_interval 352The interval (in seconds) specifying how often the secret data used in 353RFC 1948 initial sequence number calculations should be reseeded. 354By default, this variable is set to zero, indicating that 355no reseeding will occur. 356Reseeding should not be necessary, and will break 357.Dv TIME_WAIT 358recycling for a few minutes. 359.It Va rexmit_min , rexmit_slop 360Adjust the retransmit timer calculation for 361.Tn TCP . 362The slop is 363typically added to the raw calculation to take into account 364occasional variances that the 365.Tn SRTT 366(smoothed round-trip time) 367is unable to accomodate, while the minimum specifies an 368absolute minimum. 369While a number of 370.Tn TCP 371RFCs suggest a 1 372second minimum, these RFCs tend to focus on streaming behavior, 373and fail to deal with the fact that a 1 second minimum has severe 374detrimental effects over lossy interactive connections, such 375as a 802.11b wireless link, and over very fast but lossy 376connections for those cases not covered by the fast retransmit 377code. 378For this reason, we use 200ms of slop and a near-0 379minimum, which gives us an effective minimum of 200ms (similar to 380.Tn Linux ) . 381.It Va inflight_enable 382Enable 383.Tn TCP 384bandwidth-delay product limiting. 385An attempt will be made to calculate 386the bandwidth-delay product for each individual 387.Tn TCP 388connection, and limit 389the amount of inflight data being transmitted, to avoid building up 390unnecessary packets in the network. 391This option is recommended if you 392are serving a lot of data over connections with high bandwidth-delay 393products, such as modems, GigE links, and fast long-haul WANs, and/or 394you have configured your machine to accomodate large 395.Tn TCP 396windows. 397In such 398situations, without this option, you may experience high interactive 399latencies or packet loss due to the overloading of intermediate routers 400and switches. 401Note that bandwidth-delay product limiting only effects 402the transmit side of a 403.Tn TCP 404connection. 405.It Va inflight_debug 406Enable debugging for the bandwidth-delay product algorithm. 407This may 408default to on (1), so if you enable the algorithm, 409you should probably also 410disable debugging by setting this variable to 0. 411.It Va inflight_min 412This puts a lower bound on the bandwidth-delay product window, in bytes. 413A value of 1024 is typically used for debugging. 4146000-16000 is more typical in a production installation. 415Setting this value too low may result in 416slow ramp-up times for bursty connections. 417Setting this value too high effectively disables the algorithm. 418.It Va inflight_max 419This puts an upper bound on the bandwidth-delay product window, in bytes. 420This value should not generally be modified, but may be used to set a 421global per-connection limit on queued data, potentially allowing you to 422intentionally set a less than optimum limit, to smooth data flow over a 423network while still being able to specify huge internal 424.Tn TCP 425buffers. 426.It Va inflight_stab 427The bandwidth-delay product algorithm requires a slightly larger window 428than it otherwise calculates for stability. 429This parameter determines the extra window in maximal packets / 10. 430The default value of 20 represents 2 maximal packets. 431Reducing this value is not recommended, but you may 432come across a situation with very slow links where the 433.Xr ping 8 434time 435reduction of the default inflight code is not sufficient. 436If this case occurs, you should first try reducing 437.Va inflight_min 438and, if that does not 439work, reduce both 440.Va inflight_min 441and 442.Va inflight_stab , 443trying values of 44415, 10, or 5 for the latter. 445Never use a value less than 5. 446Reducing 447.Va inflight_stab 448can lead to upwards of a 20% underutilization of the link 449as well as reducing the algorithm's ability to adapt to changing 450situations and should only be done as a last resort. 451.It Va rfc3042 452Enable the Limited Transmit algorithm as described in RFC 3042. 453It 454helps avoid timeouts on lossy links and also when the congestion window 455is small, as happens on short transfers. 456This is a standards track RFC 457and is off by default. 458.It Va rfc3390 459Enable support for RFC 3390, which allows for a variable-sized 460starting congestion window on new connections, depending on the 461maximum segment size. 462This helps throughput in general, but 463particularly affects short transfers and high-bandwidth large 464propagation-delay connections. 465This is a standards track RFC and 466support for it is off by default. 467.Pp 468When this feature is enabled, the 469.Va slowstart_flightsize 470and 471.Va local_slowstart_flightsize 472settings are not observed for new 473connection slow starts, but they are still used for slow starts 474that occur when the connection has been idle and starts sending 475again. 476.El 477.Sh ERRORS 478A socket operation may fail with one of the following errors returned: 479.Bl -tag -width Er 480.It Bq Er EISCONN 481when trying to establish a connection on a socket which 482already has one; 483.It Bq Er ENOBUFS 484when the system runs out of memory for 485an internal data structure; 486.It Bq Er ETIMEDOUT 487when a connection was dropped 488due to excessive retransmissions; 489.It Bq Er ECONNRESET 490when the remote peer 491forces the connection to be closed; 492.It Bq Er ECONNREFUSED 493when the remote 494peer actively refuses connection establishment (usually because 495no process is listening to the port); 496.It Bq Er EADDRINUSE 497when an attempt 498is made to create a socket with a port which has already been 499allocated; 500.It Bq Er EADDRNOTAVAIL 501when an attempt is made to create a 502socket with a network address for which no network interface 503exists; 504.It Bq Er EAFNOSUPPORT 505when an attempt is made to bind or connect a socket to a multicast 506address. 507.El 508.Sh SEE ALSO 509.Xr getsockopt 2 , 510.Xr socket 2 , 511.Xr sysctl 3 , 512.Xr blackhole 4 , 513.Xr inet 4 , 514.Xr intro 4 , 515.Xr ip 4 , 516.Xr syncache 4 , 517.Xr ttcp 4 , 518.Xr setkey 8 519.Rs 520.%A "V. Jacobson" 521.%A "R. Braden" 522.%A "D. Borman" 523.%T "TCP Extensions for High Performance" 524.%O "RFC 1323" 525.Re 526.Rs 527.%A "R. Braden" 528.%T "T/TCP \- TCP Extensions for Transactions" 529.%O "RFC 1644" 530.Re 531.Rs 532.%A "A. Heffernan" 533.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 534.%O "RFC 2385" 535.Re 536.Sh HISTORY 537The 538.Tn TCP 539protocol appeared in 540.Bx 4.2 . 541The RFC 1323 extensions for window scaling and timestamps were added 542in 543.Bx 4.4 . 544