1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 33.\" $FreeBSD$ 34.\" 35.Dd July 10, 2004 36.Dt TCP 4 37.Os 38.Sh NAME 39.Nm tcp 40.Nd Internet Transmission Control Protocol 41.Sh SYNOPSIS 42.In sys/types.h 43.In sys/socket.h 44.In netinet/in.h 45.Ft int 46.Fn socket AF_INET SOCK_STREAM 0 47.Sh DESCRIPTION 48The 49.Tn TCP 50protocol provides reliable, flow-controlled, two-way 51transmission of data. 52It is a byte-stream protocol used to 53support the 54.Dv SOCK_STREAM 55abstraction. 56.Tn TCP 57uses the standard 58Internet address format and, in addition, provides a per-host 59collection of 60.Dq "port addresses" . 61Thus, each address is composed 62of an Internet address specifying the host and network, 63with a specific 64.Tn TCP 65port on the host identifying the peer entity. 66.Pp 67Sockets utilizing the 68.Tn TCP 69protocol are either 70.Dq active 71or 72.Dq passive . 73Active sockets initiate connections to passive 74sockets. 75By default, 76.Tn TCP 77sockets are created active; to create a 78passive socket, the 79.Xr listen 2 80system call must be used 81after binding the socket with the 82.Xr bind 2 83system call. 84Only passive sockets may use the 85.Xr accept 2 86call to accept incoming connections. 87Only active sockets may use the 88.Xr connect 2 89call to initiate connections. 90.Tn TCP 91also supports a more datagram-like mode, called Transaction 92.Tn TCP , 93which is described in 94.Xr ttcp 4 . 95.Pp 96Passive sockets may 97.Dq underspecify 98their location to match 99incoming connection requests from multiple networks. 100This technique, termed 101.Dq "wildcard addressing" , 102allows a single 103server to provide service to clients on multiple networks. 104To create a socket which listens on all networks, the Internet 105address 106.Dv INADDR_ANY 107must be bound. 108The 109.Tn TCP 110port may still be specified 111at this time; if the port is not specified, the system will assign one. 112Once a connection has been established, the socket's address is 113fixed by the peer entity's location. 114The address assigned to the 115socket is the address associated with the network interface 116through which packets are being transmitted and received. 117Normally, this address corresponds to the peer entity's network. 118.Pp 119.Tn TCP 120supports a number of socket options which can be set with 121.Xr setsockopt 2 122and tested with 123.Xr getsockopt 2 : 124.Bl -tag -width ".Dv TCP_NODELAY" 125.It Dv TCP_NODELAY 126Under most circumstances, 127.Tn TCP 128sends data when it is presented; 129when outstanding data has not yet been acknowledged, it gathers 130small amounts of output to be sent in a single packet once 131an acknowledgement is received. 132For a small number of clients, such as window systems 133that send a stream of mouse events which receive no replies, 134this packetization may cause significant delays. 135The boolean option 136.Dv TCP_NODELAY 137defeats this algorithm. 138.It Dv TCP_MAXSEG 139By default, a sender- and 140.No receiver- Ns Tn TCP 141will negotiate among themselves to determine the maximum segment size 142to be used for each connection. 143The 144.Dv TCP_MAXSEG 145option allows the user to determine the result of this negotiation, 146and to reduce it if desired. 147.It Dv TCP_NOOPT 148.Tn TCP 149usually sends a number of options in each packet, corresponding to 150various 151.Tn TCP 152extensions which are provided in this implementation. 153The boolean option 154.Dv TCP_NOOPT 155is provided to disable 156.Tn TCP 157option use on a per-connection basis. 158.It Dv TCP_NOPUSH 159By convention, the 160.No sender- Ns Tn TCP 161will set the 162.Dq push 163bit, and begin transmission immediately (if permitted) at the end of 164every user call to 165.Xr write 2 166or 167.Xr writev 2 . 168The 169.Dv TCP_NOPUSH 170option is provided to allow servers to easily make use of Transaction 171.Tn TCP 172(see 173.Xr ttcp 4 ) . 174When this option is set to a non-zero value, 175.Tn TCP 176will delay sending any data at all until either the socket is closed, 177or the internal send buffer is filled. 178.It Dv TCP_MD5SIG 179This option enables the use of MD5 digests (also known as TCP-MD5) 180on writes to the specified socket. 181In the current release, only outgoing traffic is digested; 182digests on incoming traffic are not verified. 183The current default behavior for the system is to respond to a system 184advertising this option with TCP-MD5; this may change. 185.Pp 186One common use for this in a 187.Fx 188router deployment is to enable 189based routers to interwork with Cisco equipment at peering points. 190Support for this feature conforms to RFC 2385. 191Only IPv4 192.Pq Dv AF_INET 193sessions are supported. 194.Pp 195In order for this option to function correctly, it is necessary for the 196administrator to add a tcp-md5 key entry to the system's security 197associations database (SADB) using the 198.Xr setkey 8 199utility. 200This entry must have an SPI of 0x1000 and can therefore only be specified 201on a per-host basis at this time. 202.Pp 203If an SADB entry cannot be found for the destination, the outgoing traffic 204will have an invalid digest option prepended, and the following error message 205will be visible on the system console: 206.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 207.El 208.Pp 209The option level for the 210.Xr setsockopt 2 211call is the protocol number for 212.Tn TCP , 213available from 214.Xr getprotobyname 3 , 215or 216.Dv IPPROTO_TCP . 217All options are declared in 218.In netinet/tcp.h . 219.Pp 220Options at the 221.Tn IP 222transport level may be used with 223.Tn TCP ; 224see 225.Xr ip 4 . 226Incoming connection requests that are source-routed are noted, 227and the reverse source route is used in responding. 228.Ss MIB Variables 229The 230.Tn TCP 231protocol implements a number of variables in the 232.Va net.inet.tcp 233branch of the 234.Xr sysctl 3 235MIB. 236.Bl -tag -width ".Va TCPCTL_DO_RFC1644" 237.It Dv TCPCTL_DO_RFC1323 238.Pq Va rfc1323 239Implement the window scaling and timestamp options of RFC 1323 240(default is true). 241.It Dv TCPCTL_DO_RFC1644 242.Pq Va rfc1644 243Implement Transaction 244.Tn TCP , 245as described in RFC 1644. 246.It Dv TCPCTL_MSSDFLT 247.Pq Va mssdflt 248The default value used for the maximum segment size 249.Pq Dq MSS 250when no advice to the contrary is received from MSS negotiation. 251.It Dv TCPCTL_SENDSPACE 252.Pq Va sendspace 253Maximum 254.Tn TCP 255send window. 256.It Dv TCPCTL_RECVSPACE 257.Pq Va recvspace 258Maximum 259.Tn TCP 260receive window. 261.It Va log_in_vain 262Log any connection attempts to ports where there is not a socket 263accepting connections. 264The value of 1 limits the logging to 265.Tn SYN 266(connection establishment) packets only. 267That of 2 results in any 268.Tn TCP 269packets to closed ports being logged. 270Any value unlisted above disables the logging 271(default is 0, i.e., the logging is disabled). 272.It Va slowstart_flightsize 273The number of packets allowed to be in-flight during the 274.Tn TCP 275slow-start phase on a non-local network. 276.It Va local_slowstart_flightsize 277The number of packets allowed to be in-flight during the 278.Tn TCP 279slow-start phase to local machines in the same subnet. 280.It Va msl 281The Maximum Segment Lifetime, in milliseconds, for a packet. 282.It Va keepinit 283Timeout, in milliseconds, for new, non-established 284.Tn TCP 285connections. 286.It Va keepidle 287Amount of time, in milliseconds, that the connection must be idle 288before keepalive probes (if enabled) are sent. 289.It Va keepintvl 290The interval, in milliseconds, between keepalive probes sent to remote 291machines. 292After 293.Dv TCPTV_KEEPCNT 294(default 8) probes are sent, with no response, the connection is dropped. 295.It Va always_keepalive 296Assume that 297.Dv SO_KEEPALIVE 298is set on all 299.Tn TCP 300connections, the kernel will 301periodically send a packet to the remote host to verify the connection 302is still up. 303.It Va icmp_may_rst 304Certain 305.Tn ICMP 306unreachable messages may abort connections in 307.Tn SYN-SENT 308state. 309.It Va do_tcpdrain 310Flush packets in the 311.Tn TCP 312reassembly queue if the system is low on mbufs. 313.It Va blackhole 314If enabled, disable sending of RST when a connection is attempted 315to a port where there is not a socket accepting connections. 316See 317.Xr blackhole 4 . 318.It Va delayed_ack 319Delay ACK to try and piggyback it onto a data packet. 320.It Va delacktime 321Maximum amount of time, in milliseconds, before a delayed ACK is sent. 322.It Va newreno 323Enable 324.Tn TCP 325NewReno Fast Recovery algorithm, 326as described in RFC 2582. 327.It Va path_mtu_discovery 328Enable Path MTU Discovery. 329.It Va tcbhashsize 330Size of the 331.Tn TCP 332control-block hash table 333(read-only). 334This may be tuned using the kernel option 335.Dv TCBHASHSIZE 336or by setting 337.Va net.inet.tcp.tcbhashsize 338in the 339.Xr loader 8 . 340.It Va pcbcount 341Number of active process control blocks 342(read-only). 343.It Va syncookies 344Determines whether or not 345.Tn SYN 346cookies should be generated for outbound 347.Tn SYN-ACK 348packets. 349.Tn SYN 350cookies are a great help during 351.Tn SYN 352flood attacks, and are enabled by default. 353(See 354.Xr syncookies 4 . ) 355.It Va isn_reseed_interval 356The interval (in seconds) specifying how often the secret data used in 357RFC 1948 initial sequence number calculations should be reseeded. 358By default, this variable is set to zero, indicating that 359no reseeding will occur. 360Reseeding should not be necessary, and will break 361.Dv TIME_WAIT 362recycling for a few minutes. 363.It Va rexmit_min , rexmit_slop 364Adjust the retransmit timer calculation for 365.Tn TCP . 366The slop is 367typically added to the raw calculation to take into account 368occasional variances that the 369.Tn SRTT 370(smoothed round-trip time) 371is unable to accommodate, while the minimum specifies an 372absolute minimum. 373While a number of 374.Tn TCP 375RFCs suggest a 1 376second minimum, these RFCs tend to focus on streaming behavior, 377and fail to deal with the fact that a 1 second minimum has severe 378detrimental effects over lossy interactive connections, such 379as a 802.11b wireless link, and over very fast but lossy 380connections for those cases not covered by the fast retransmit 381code. 382For this reason, we use 200ms of slop and a near-0 383minimum, which gives us an effective minimum of 200ms (similar to 384.Tn Linux ) . 385.It Va inflight_enable 386Enable 387.Tn TCP 388bandwidth-delay product limiting. 389An attempt will be made to calculate 390the bandwidth-delay product for each individual 391.Tn TCP 392connection, and limit 393the amount of inflight data being transmitted, to avoid building up 394unnecessary packets in the network. 395This option is recommended if you 396are serving a lot of data over connections with high bandwidth-delay 397products, such as modems, GigE links, and fast long-haul WANs, and/or 398you have configured your machine to accommodate large 399.Tn TCP 400windows. 401In such 402situations, without this option, you may experience high interactive 403latencies or packet loss due to the overloading of intermediate routers 404and switches. 405Note that bandwidth-delay product limiting only effects 406the transmit side of a 407.Tn TCP 408connection. 409.It Va inflight_debug 410Enable debugging for the bandwidth-delay product algorithm. 411This may 412default to on (1), so if you enable the algorithm, 413you should probably also 414disable debugging by setting this variable to 0. 415.It Va inflight_min 416This puts a lower bound on the bandwidth-delay product window, in bytes. 417A value of 1024 is typically used for debugging. 4186000-16000 is more typical in a production installation. 419Setting this value too low may result in 420slow ramp-up times for bursty connections. 421Setting this value too high effectively disables the algorithm. 422.It Va inflight_max 423This puts an upper bound on the bandwidth-delay product window, in bytes. 424This value should not generally be modified, but may be used to set a 425global per-connection limit on queued data, potentially allowing you to 426intentionally set a less than optimum limit, to smooth data flow over a 427network while still being able to specify huge internal 428.Tn TCP 429buffers. 430.It Va inflight_stab 431The bandwidth-delay product algorithm requires a slightly larger window 432than it otherwise calculates for stability. 433This parameter determines the extra window in maximal packets / 10. 434The default value of 20 represents 2 maximal packets. 435Reducing this value is not recommended, but you may 436come across a situation with very slow links where the 437.Xr ping 8 438time 439reduction of the default inflight code is not sufficient. 440If this case occurs, you should first try reducing 441.Va inflight_min 442and, if that does not 443work, reduce both 444.Va inflight_min 445and 446.Va inflight_stab , 447trying values of 44815, 10, or 5 for the latter. 449Never use a value less than 5. 450Reducing 451.Va inflight_stab 452can lead to upwards of a 20% underutilization of the link 453as well as reducing the algorithm's ability to adapt to changing 454situations and should only be done as a last resort. 455.It Va rfc3042 456Enable the Limited Transmit algorithm as described in RFC 3042. 457It 458helps avoid timeouts on lossy links and also when the congestion window 459is small, as happens on short transfers. 460This is a standards track RFC 461and is off by default. 462.It Va rfc3390 463Enable support for RFC 3390, which allows for a variable-sized 464starting congestion window on new connections, depending on the 465maximum segment size. 466This helps throughput in general, but 467particularly affects short transfers and high-bandwidth large 468propagation-delay connections. 469This is a standards track RFC and 470support for it is off by default. 471.Pp 472When this feature is enabled, the 473.Va slowstart_flightsize 474and 475.Va local_slowstart_flightsize 476settings are not observed for new 477connection slow starts, but they are still used for slow starts 478that occur when the connection has been idle and starts sending 479again. 480.It Va sack.enable 481Enable support for RFC 2018, TCP Selective Acknowledgment option, 482which allows the receiver to inform the sender about all successfully 483arrived segments, allowing the sender to retransmit the missing segments 484only. 485.El 486.Sh ERRORS 487A socket operation may fail with one of the following errors returned: 488.Bl -tag -width Er 489.It Bq Er EISCONN 490when trying to establish a connection on a socket which 491already has one; 492.It Bq Er ENOBUFS 493when the system runs out of memory for 494an internal data structure; 495.It Bq Er ETIMEDOUT 496when a connection was dropped 497due to excessive retransmissions; 498.It Bq Er ECONNRESET 499when the remote peer 500forces the connection to be closed; 501.It Bq Er ECONNREFUSED 502when the remote 503peer actively refuses connection establishment (usually because 504no process is listening to the port); 505.It Bq Er EADDRINUSE 506when an attempt 507is made to create a socket with a port which has already been 508allocated; 509.It Bq Er EADDRNOTAVAIL 510when an attempt is made to create a 511socket with a network address for which no network interface 512exists; 513.It Bq Er EAFNOSUPPORT 514when an attempt is made to bind or connect a socket to a multicast 515address. 516.El 517.Sh SEE ALSO 518.Xr getsockopt 2 , 519.Xr socket 2 , 520.Xr sysctl 3 , 521.Xr blackhole 4 , 522.Xr inet 4 , 523.Xr intro 4 , 524.Xr ip 4 , 525.Xr syncache 4 , 526.Xr ttcp 4 , 527.Xr setkey 8 528.Rs 529.%A "V. Jacobson" 530.%A "R. Braden" 531.%A "D. Borman" 532.%T "TCP Extensions for High Performance" 533.%O "RFC 1323" 534.Re 535.Rs 536.%A "R. Braden" 537.%T "T/TCP - TCP Extensions for Transactions" 538.%O "RFC 1644" 539.Re 540.Rs 541.%A "A. Heffernan" 542.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 543.%O "RFC 2385" 544.Re 545.Sh HISTORY 546The 547.Tn TCP 548protocol appeared in 549.Bx 4.2 . 550The RFC 1323 extensions for window scaling and timestamps were added 551in 552.Bx 4.4 . 553