1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. 3.\" Copyright (c) 2010-2011 The FreeBSD Foundation 4.\" All rights reserved. 5.\" 6.\" Portions of this documentation were written at the Centre for Advanced 7.\" Internet Architectures, Swinburne University of Technology, Melbourne, 8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 39.\" $FreeBSD$ 40.\" 41.Dd February 15, 2011 42.Dt TCP 4 43.Os 44.Sh NAME 45.Nm tcp 46.Nd Internet Transmission Control Protocol 47.Sh SYNOPSIS 48.In sys/types.h 49.In sys/socket.h 50.In netinet/in.h 51.Ft int 52.Fn socket AF_INET SOCK_STREAM 0 53.Sh DESCRIPTION 54The 55.Tn TCP 56protocol provides reliable, flow-controlled, two-way 57transmission of data. 58It is a byte-stream protocol used to 59support the 60.Dv SOCK_STREAM 61abstraction. 62.Tn TCP 63uses the standard 64Internet address format and, in addition, provides a per-host 65collection of 66.Dq "port addresses" . 67Thus, each address is composed 68of an Internet address specifying the host and network, 69with a specific 70.Tn TCP 71port on the host identifying the peer entity. 72.Pp 73Sockets utilizing the 74.Tn TCP 75protocol are either 76.Dq active 77or 78.Dq passive . 79Active sockets initiate connections to passive 80sockets. 81By default, 82.Tn TCP 83sockets are created active; to create a 84passive socket, the 85.Xr listen 2 86system call must be used 87after binding the socket with the 88.Xr bind 2 89system call. 90Only passive sockets may use the 91.Xr accept 2 92call to accept incoming connections. 93Only active sockets may use the 94.Xr connect 2 95call to initiate connections. 96.Pp 97Passive sockets may 98.Dq underspecify 99their location to match 100incoming connection requests from multiple networks. 101This technique, termed 102.Dq "wildcard addressing" , 103allows a single 104server to provide service to clients on multiple networks. 105To create a socket which listens on all networks, the Internet 106address 107.Dv INADDR_ANY 108must be bound. 109The 110.Tn TCP 111port may still be specified 112at this time; if the port is not specified, the system will assign one. 113Once a connection has been established, the socket's address is 114fixed by the peer entity's location. 115The address assigned to the 116socket is the address associated with the network interface 117through which packets are being transmitted and received. 118Normally, this address corresponds to the peer entity's network. 119.Pp 120.Tn TCP 121supports a number of socket options which can be set with 122.Xr setsockopt 2 123and tested with 124.Xr getsockopt 2 : 125.Bl -tag -width ".Dv TCP_CONGESTION" 126.It Dv TCP_INFO 127Information about a socket's underlying TCP session may be retrieved 128by passing the read-only option 129.Dv TCP_INFO 130to 131.Xr getsockopt 2 . 132It accepts a single argument: a pointer to an instance of 133.Vt "struct tcp_info" . 134.Pp 135This API is subject to change; consult the source to determine 136which fields are currently filled out by this option. 137.Fx 138specific additions include 139send window size, 140receive window size, 141and 142bandwidth-controlled window space. 143.It Dv TCP_CONGESTION 144Select or query the congestion control algorithm that TCP will use for the 145connection. 146See 147.Xr cc 4 148for details. 149.It Dv TCP_NODELAY 150Under most circumstances, 151.Tn TCP 152sends data when it is presented; 153when outstanding data has not yet been acknowledged, it gathers 154small amounts of output to be sent in a single packet once 155an acknowledgement is received. 156For a small number of clients, such as window systems 157that send a stream of mouse events which receive no replies, 158this packetization may cause significant delays. 159The boolean option 160.Dv TCP_NODELAY 161defeats this algorithm. 162.It Dv TCP_MAXSEG 163By default, a sender- and 164.No receiver- Ns Tn TCP 165will negotiate among themselves to determine the maximum segment size 166to be used for each connection. 167The 168.Dv TCP_MAXSEG 169option allows the user to determine the result of this negotiation, 170and to reduce it if desired. 171.It Dv TCP_NOOPT 172.Tn TCP 173usually sends a number of options in each packet, corresponding to 174various 175.Tn TCP 176extensions which are provided in this implementation. 177The boolean option 178.Dv TCP_NOOPT 179is provided to disable 180.Tn TCP 181option use on a per-connection basis. 182.It Dv TCP_NOPUSH 183By convention, the 184.No sender- Ns Tn TCP 185will set the 186.Dq push 187bit, and begin transmission immediately (if permitted) at the end of 188every user call to 189.Xr write 2 190or 191.Xr writev 2 . 192When this option is set to a non-zero value, 193.Tn TCP 194will delay sending any data at all until either the socket is closed, 195or the internal send buffer is filled. 196.It Dv TCP_MD5SIG 197This option enables the use of MD5 digests (also known as TCP-MD5) 198on writes to the specified socket. 199In the current release, only outgoing traffic is digested; 200digests on incoming traffic are not verified. 201The current default behavior for the system is to respond to a system 202advertising this option with TCP-MD5; this may change. 203.Pp 204One common use for this in a 205.Fx 206router deployment is to enable 207based routers to interwork with Cisco equipment at peering points. 208Support for this feature conforms to RFC 2385. 209Only IPv4 210.Pq Dv AF_INET 211sessions are supported. 212.Pp 213In order for this option to function correctly, it is necessary for the 214administrator to add a tcp-md5 key entry to the system's security 215associations database (SADB) using the 216.Xr setkey 8 217utility. 218This entry must have an SPI of 0x1000 and can therefore only be specified 219on a per-host basis at this time. 220.Pp 221If an SADB entry cannot be found for the destination, the outgoing traffic 222will have an invalid digest option prepended, and the following error message 223will be visible on the system console: 224.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 225.El 226.Pp 227The option level for the 228.Xr setsockopt 2 229call is the protocol number for 230.Tn TCP , 231available from 232.Xr getprotobyname 3 , 233or 234.Dv IPPROTO_TCP . 235All options are declared in 236.In netinet/tcp.h . 237.Pp 238Options at the 239.Tn IP 240transport level may be used with 241.Tn TCP ; 242see 243.Xr ip 4 . 244Incoming connection requests that are source-routed are noted, 245and the reverse source route is used in responding. 246.Pp 247The default congestion control algorithm for 248.Tn TCP 249is 250.Xr cc_newreno 4 . 251Other congestion control algorithms can be made available using the 252.Xr cc 4 253framework. 254.Ss MIB Variables 255The 256.Tn TCP 257protocol implements a number of variables in the 258.Va net.inet.tcp 259branch of the 260.Xr sysctl 3 261MIB. 262.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 263.It Dv TCPCTL_DO_RFC1323 264.Pq Va rfc1323 265Implement the window scaling and timestamp options of RFC 1323 266(default is true). 267.It Dv TCPCTL_MSSDFLT 268.Pq Va mssdflt 269The default value used for the maximum segment size 270.Pq Dq MSS 271when no advice to the contrary is received from MSS negotiation. 272.It Dv TCPCTL_SENDSPACE 273.Pq Va sendspace 274Maximum 275.Tn TCP 276send window. 277.It Dv TCPCTL_RECVSPACE 278.Pq Va recvspace 279Maximum 280.Tn TCP 281receive window. 282.It Va log_in_vain 283Log any connection attempts to ports where there is not a socket 284accepting connections. 285The value of 1 limits the logging to 286.Tn SYN 287(connection establishment) packets only. 288That of 2 results in any 289.Tn TCP 290packets to closed ports being logged. 291Any value unlisted above disables the logging 292(default is 0, i.e., the logging is disabled). 293.It Va slowstart_flightsize 294The number of packets allowed to be in-flight during the 295.Tn TCP 296slow-start phase on a non-local network. 297.It Va local_slowstart_flightsize 298The number of packets allowed to be in-flight during the 299.Tn TCP 300slow-start phase to local machines in the same subnet. 301.It Va msl 302The Maximum Segment Lifetime, in milliseconds, for a packet. 303.It Va keepinit 304Timeout, in milliseconds, for new, non-established 305.Tn TCP 306connections. 307.It Va keepidle 308Amount of time, in milliseconds, that the connection must be idle 309before keepalive probes (if enabled) are sent. 310.It Va keepintvl 311The interval, in milliseconds, between keepalive probes sent to remote 312machines, when no response is received on a 313.Va keepidle 314probe. 315After 316.Dv TCPTV_KEEPCNT 317(default 8) probes are sent, with no response, the connection is dropped. 318.It Va always_keepalive 319Assume that 320.Dv SO_KEEPALIVE 321is set on all 322.Tn TCP 323connections, the kernel will 324periodically send a packet to the remote host to verify the connection 325is still up. 326.It Va icmp_may_rst 327Certain 328.Tn ICMP 329unreachable messages may abort connections in 330.Tn SYN-SENT 331state. 332.It Va do_tcpdrain 333Flush packets in the 334.Tn TCP 335reassembly queue if the system is low on mbufs. 336.It Va blackhole 337If enabled, disable sending of RST when a connection is attempted 338to a port where there is not a socket accepting connections. 339See 340.Xr blackhole 4 . 341.It Va delayed_ack 342Delay ACK to try and piggyback it onto a data packet. 343.It Va delacktime 344Maximum amount of time, in milliseconds, before a delayed ACK is sent. 345.It Va path_mtu_discovery 346Enable Path MTU Discovery. 347.It Va tcbhashsize 348Size of the 349.Tn TCP 350control-block hash table 351(read-only). 352This may be tuned using the kernel option 353.Dv TCBHASHSIZE 354or by setting 355.Va net.inet.tcp.tcbhashsize 356in the 357.Xr loader 8 . 358.It Va pcbcount 359Number of active process control blocks 360(read-only). 361.It Va syncookies 362Determines whether or not 363.Tn SYN 364cookies should be generated for outbound 365.Tn SYN-ACK 366packets. 367.Tn SYN 368cookies are a great help during 369.Tn SYN 370flood attacks, and are enabled by default. 371(See 372.Xr syncookies 4 . ) 373.It Va isn_reseed_interval 374The interval (in seconds) specifying how often the secret data used in 375RFC 1948 initial sequence number calculations should be reseeded. 376By default, this variable is set to zero, indicating that 377no reseeding will occur. 378Reseeding should not be necessary, and will break 379.Dv TIME_WAIT 380recycling for a few minutes. 381.It Va rexmit_min , rexmit_slop 382Adjust the retransmit timer calculation for 383.Tn TCP . 384The slop is 385typically added to the raw calculation to take into account 386occasional variances that the 387.Tn SRTT 388(smoothed round-trip time) 389is unable to accommodate, while the minimum specifies an 390absolute minimum. 391While a number of 392.Tn TCP 393RFCs suggest a 1 394second minimum, these RFCs tend to focus on streaming behavior, 395and fail to deal with the fact that a 1 second minimum has severe 396detrimental effects over lossy interactive connections, such 397as a 802.11b wireless link, and over very fast but lossy 398connections for those cases not covered by the fast retransmit 399code. 400For this reason, we use 200ms of slop and a near-0 401minimum, which gives us an effective minimum of 200ms (similar to 402.Tn Linux ) . 403.It Va rfc3042 404Enable the Limited Transmit algorithm as described in RFC 3042. 405It helps avoid timeouts on lossy links and also when the congestion window 406is small, as happens on short transfers. 407.It Va rfc3390 408Enable support for RFC 3390, which allows for a variable-sized 409starting congestion window on new connections, depending on the 410maximum segment size. 411This helps throughput in general, but 412particularly affects short transfers and high-bandwidth large 413propagation-delay connections. 414.Pp 415When this feature is enabled, the 416.Va slowstart_flightsize 417and 418.Va local_slowstart_flightsize 419settings are not observed for new 420connection slow starts, but they are still used for slow starts 421that occur when the connection has been idle and starts sending 422again. 423.It Va sack.enable 424Enable support for RFC 2018, TCP Selective Acknowledgment option, 425which allows the receiver to inform the sender about all successfully 426arrived segments, allowing the sender to retransmit the missing segments 427only. 428.It Va sack.maxholes 429Maximum number of SACK holes per connection. 430Defaults to 128. 431.It Va sack.globalmaxholes 432Maximum number of SACK holes per system, across all connections. 433Defaults to 65536. 434.It Va maxtcptw 435When a TCP connection enters the 436.Dv TIME_WAIT 437state, its associated socket structure is freed, since it is of 438negligible size and use, and a new structure is allocated to contain a 439minimal amount of information necessary for sustaining a connection in 440this state, called the compressed TCP TIME_WAIT state. 441Since this structure is smaller than a socket structure, it can save 442a significant amount of system memory. 443The 444.Va net.inet.tcp.maxtcptw 445MIB variable controls the maximum number of these structures allocated. 446By default, it is initialized to 447.Va kern.ipc.maxsockets 448/ 5. 449.It Va nolocaltimewait 450Suppress creating of compressed TCP TIME_WAIT states for connections in 451which both endpoints are local. 452.It Va fast_finwait2_recycle 453Recycle 454.Tn TCP 455.Dv FIN_WAIT_2 456connections faster when the socket is marked as 457.Dv SBS_CANTRCVMORE 458(no user process has the socket open, data received on 459the socket cannot be read). 460The timeout used here is 461.Va finwait2_timeout . 462.It Va finwait2_timeout 463Timeout to use for fast recycling of 464.Tn TCP 465.Dv FIN_WAIT_2 466connections. 467Defaults to 60 seconds. 468.It Va ecn.enable 469Enable support for TCP Explicit Congestion Notification (ECN). 470ECN allows a TCP sender to reduce the transmission rate in order to 471avoid packet drops. 472.It Va ecn.maxretries 473Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 474specific connection. This is needed to help with connection establishment 475when a broken firewall is in the network path. 476.El 477.Sh ERRORS 478A socket operation may fail with one of the following errors returned: 479.Bl -tag -width Er 480.It Bq Er EISCONN 481when trying to establish a connection on a socket which 482already has one; 483.It Bq Er ENOBUFS 484when the system runs out of memory for 485an internal data structure; 486.It Bq Er ETIMEDOUT 487when a connection was dropped 488due to excessive retransmissions; 489.It Bq Er ECONNRESET 490when the remote peer 491forces the connection to be closed; 492.It Bq Er ECONNREFUSED 493when the remote 494peer actively refuses connection establishment (usually because 495no process is listening to the port); 496.It Bq Er EADDRINUSE 497when an attempt 498is made to create a socket with a port which has already been 499allocated; 500.It Bq Er EADDRNOTAVAIL 501when an attempt is made to create a 502socket with a network address for which no network interface 503exists; 504.It Bq Er EAFNOSUPPORT 505when an attempt is made to bind or connect a socket to a multicast 506address. 507.El 508.Sh SEE ALSO 509.Xr getsockopt 2 , 510.Xr socket 2 , 511.Xr sysctl 3 , 512.Xr blackhole 4 , 513.Xr cc 4 , 514.Xr inet 4 , 515.Xr intro 4 , 516.Xr ip 4 , 517.Xr syncache 4 , 518.Xr setkey 8 519.Rs 520.%A "V. Jacobson" 521.%A "R. Braden" 522.%A "D. Borman" 523.%T "TCP Extensions for High Performance" 524.%O "RFC 1323" 525.Re 526.Rs 527.%A "A. Heffernan" 528.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 529.%O "RFC 2385" 530.Re 531.Rs 532.%A "K. Ramakrishnan" 533.%A "S. Floyd" 534.%A "D. Black" 535.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 536.%O "RFC 3168" 537.Re 538.Sh HISTORY 539The 540.Tn TCP 541protocol appeared in 542.Bx 4.2 . 543The RFC 1323 extensions for window scaling and timestamps were added 544in 545.Bx 4.4 . 546The 547.Dv TCP_INFO 548option was introduced in 549.Tn Linux 2.6 550and is 551.Em subject to change . 552