1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. 3.\" Copyright (c) 2010-2011 The FreeBSD Foundation 4.\" All rights reserved. 5.\" 6.\" Portions of this documentation were written at the Centre for Advanced 7.\" Internet Architectures, Swinburne University of Technology, Melbourne, 8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 39.\" $FreeBSD$ 40.\" 41.Dd November 14, 2011 42.Dt TCP 4 43.Os 44.Sh NAME 45.Nm tcp 46.Nd Internet Transmission Control Protocol 47.Sh SYNOPSIS 48.In sys/types.h 49.In sys/socket.h 50.In netinet/in.h 51.Ft int 52.Fn socket AF_INET SOCK_STREAM 0 53.Sh DESCRIPTION 54The 55.Tn TCP 56protocol provides reliable, flow-controlled, two-way 57transmission of data. 58It is a byte-stream protocol used to 59support the 60.Dv SOCK_STREAM 61abstraction. 62.Tn TCP 63uses the standard 64Internet address format and, in addition, provides a per-host 65collection of 66.Dq "port addresses" . 67Thus, each address is composed 68of an Internet address specifying the host and network, 69with a specific 70.Tn TCP 71port on the host identifying the peer entity. 72.Pp 73Sockets utilizing the 74.Tn TCP 75protocol are either 76.Dq active 77or 78.Dq passive . 79Active sockets initiate connections to passive 80sockets. 81By default, 82.Tn TCP 83sockets are created active; to create a 84passive socket, the 85.Xr listen 2 86system call must be used 87after binding the socket with the 88.Xr bind 2 89system call. 90Only passive sockets may use the 91.Xr accept 2 92call to accept incoming connections. 93Only active sockets may use the 94.Xr connect 2 95call to initiate connections. 96.Pp 97Passive sockets may 98.Dq underspecify 99their location to match 100incoming connection requests from multiple networks. 101This technique, termed 102.Dq "wildcard addressing" , 103allows a single 104server to provide service to clients on multiple networks. 105To create a socket which listens on all networks, the Internet 106address 107.Dv INADDR_ANY 108must be bound. 109The 110.Tn TCP 111port may still be specified 112at this time; if the port is not specified, the system will assign one. 113Once a connection has been established, the socket's address is 114fixed by the peer entity's location. 115The address assigned to the 116socket is the address associated with the network interface 117through which packets are being transmitted and received. 118Normally, this address corresponds to the peer entity's network. 119.Pp 120.Tn TCP 121supports a number of socket options which can be set with 122.Xr setsockopt 2 123and tested with 124.Xr getsockopt 2 : 125.Bl -tag -width ".Dv TCP_CONGESTION" 126.It Dv TCP_INFO 127Information about a socket's underlying TCP session may be retrieved 128by passing the read-only option 129.Dv TCP_INFO 130to 131.Xr getsockopt 2 . 132It accepts a single argument: a pointer to an instance of 133.Vt "struct tcp_info" . 134.Pp 135This API is subject to change; consult the source to determine 136which fields are currently filled out by this option. 137.Fx 138specific additions include 139send window size, 140receive window size, 141and 142bandwidth-controlled window space. 143.It Dv TCP_CONGESTION 144Select or query the congestion control algorithm that TCP will use for the 145connection. 146See 147.Xr mod_cc 4 148for details. 149.It Dv TCP_NODELAY 150Under most circumstances, 151.Tn TCP 152sends data when it is presented; 153when outstanding data has not yet been acknowledged, it gathers 154small amounts of output to be sent in a single packet once 155an acknowledgement is received. 156For a small number of clients, such as window systems 157that send a stream of mouse events which receive no replies, 158this packetization may cause significant delays. 159The boolean option 160.Dv TCP_NODELAY 161defeats this algorithm. 162.It Dv TCP_MAXSEG 163By default, a sender- and 164.No receiver- Ns Tn TCP 165will negotiate among themselves to determine the maximum segment size 166to be used for each connection. 167The 168.Dv TCP_MAXSEG 169option allows the user to determine the result of this negotiation, 170and to reduce it if desired. 171.It Dv TCP_NOOPT 172.Tn TCP 173usually sends a number of options in each packet, corresponding to 174various 175.Tn TCP 176extensions which are provided in this implementation. 177The boolean option 178.Dv TCP_NOOPT 179is provided to disable 180.Tn TCP 181option use on a per-connection basis. 182.It Dv TCP_NOPUSH 183By convention, the 184.No sender- Ns Tn TCP 185will set the 186.Dq push 187bit, and begin transmission immediately (if permitted) at the end of 188every user call to 189.Xr write 2 190or 191.Xr writev 2 . 192When this option is set to a non-zero value, 193.Tn TCP 194will delay sending any data at all until either the socket is closed, 195or the internal send buffer is filled. 196.It Dv TCP_MD5SIG 197This option enables the use of MD5 digests (also known as TCP-MD5) 198on writes to the specified socket. 199In the current release, only outgoing traffic is digested; 200digests on incoming traffic are not verified. 201The current default behavior for the system is to respond to a system 202advertising this option with TCP-MD5; this may change. 203.Pp 204One common use for this in a 205.Fx 206router deployment is to enable 207based routers to interwork with Cisco equipment at peering points. 208Support for this feature conforms to RFC 2385. 209Only IPv4 210.Pq Dv AF_INET 211sessions are supported. 212.Pp 213In order for this option to function correctly, it is necessary for the 214administrator to add a tcp-md5 key entry to the system's security 215associations database (SADB) using the 216.Xr setkey 8 217utility. 218This entry must have an SPI of 0x1000 and can therefore only be specified 219on a per-host basis at this time. 220.Pp 221If an SADB entry cannot be found for the destination, the outgoing traffic 222will have an invalid digest option prepended, and the following error message 223will be visible on the system console: 224.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 225.El 226.Pp 227The option level for the 228.Xr setsockopt 2 229call is the protocol number for 230.Tn TCP , 231available from 232.Xr getprotobyname 3 , 233or 234.Dv IPPROTO_TCP . 235All options are declared in 236.In netinet/tcp.h . 237.Pp 238Options at the 239.Tn IP 240transport level may be used with 241.Tn TCP ; 242see 243.Xr ip 4 . 244Incoming connection requests that are source-routed are noted, 245and the reverse source route is used in responding. 246.Pp 247The default congestion control algorithm for 248.Tn TCP 249is 250.Xr cc_newreno 4 . 251Other congestion control algorithms can be made available using the 252.Xr mod_cc 4 253framework. 254.Ss MIB Variables 255The 256.Tn TCP 257protocol implements a number of variables in the 258.Va net.inet.tcp 259branch of the 260.Xr sysctl 3 261MIB. 262.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 263.It Dv TCPCTL_DO_RFC1323 264.Pq Va rfc1323 265Implement the window scaling and timestamp options of RFC 1323 266(default is true). 267.It Dv TCPCTL_MSSDFLT 268.Pq Va mssdflt 269The default value used for the maximum segment size 270.Pq Dq MSS 271when no advice to the contrary is received from MSS negotiation. 272.It Dv TCPCTL_SENDSPACE 273.Pq Va sendspace 274Maximum 275.Tn TCP 276send window. 277.It Dv TCPCTL_RECVSPACE 278.Pq Va recvspace 279Maximum 280.Tn TCP 281receive window. 282.It Va log_in_vain 283Log any connection attempts to ports where there is not a socket 284accepting connections. 285The value of 1 limits the logging to 286.Tn SYN 287(connection establishment) packets only. 288That of 2 results in any 289.Tn TCP 290packets to closed ports being logged. 291Any value unlisted above disables the logging 292(default is 0, i.e., the logging is disabled). 293.It Va msl 294The Maximum Segment Lifetime, in milliseconds, for a packet. 295.It Va keepinit 296Timeout, in milliseconds, for new, non-established 297.Tn TCP 298connections. 299.It Va keepidle 300Amount of time, in milliseconds, that the connection must be idle 301before keepalive probes (if enabled) are sent. 302.It Va keepintvl 303The interval, in milliseconds, between keepalive probes sent to remote 304machines, when no response is received on a 305.Va keepidle 306probe. 307After 308.Dv TCPTV_KEEPCNT 309(default 8) probes are sent, with no response, the connection is dropped. 310.It Va always_keepalive 311Assume that 312.Dv SO_KEEPALIVE 313is set on all 314.Tn TCP 315connections, the kernel will 316periodically send a packet to the remote host to verify the connection 317is still up. 318.It Va icmp_may_rst 319Certain 320.Tn ICMP 321unreachable messages may abort connections in 322.Tn SYN-SENT 323state. 324.It Va do_tcpdrain 325Flush packets in the 326.Tn TCP 327reassembly queue if the system is low on mbufs. 328.It Va blackhole 329If enabled, disable sending of RST when a connection is attempted 330to a port where there is not a socket accepting connections. 331See 332.Xr blackhole 4 . 333.It Va delayed_ack 334Delay ACK to try and piggyback it onto a data packet. 335.It Va delacktime 336Maximum amount of time, in milliseconds, before a delayed ACK is sent. 337.It Va path_mtu_discovery 338Enable Path MTU Discovery. 339.It Va tcbhashsize 340Size of the 341.Tn TCP 342control-block hash table 343(read-only). 344This may be tuned using the kernel option 345.Dv TCBHASHSIZE 346or by setting 347.Va net.inet.tcp.tcbhashsize 348in the 349.Xr loader 8 . 350.It Va pcbcount 351Number of active process control blocks 352(read-only). 353.It Va syncookies 354Determines whether or not 355.Tn SYN 356cookies should be generated for outbound 357.Tn SYN-ACK 358packets. 359.Tn SYN 360cookies are a great help during 361.Tn SYN 362flood attacks, and are enabled by default. 363(See 364.Xr syncookies 4 . ) 365.It Va isn_reseed_interval 366The interval (in seconds) specifying how often the secret data used in 367RFC 1948 initial sequence number calculations should be reseeded. 368By default, this variable is set to zero, indicating that 369no reseeding will occur. 370Reseeding should not be necessary, and will break 371.Dv TIME_WAIT 372recycling for a few minutes. 373.It Va rexmit_min , rexmit_slop 374Adjust the retransmit timer calculation for 375.Tn TCP . 376The slop is 377typically added to the raw calculation to take into account 378occasional variances that the 379.Tn SRTT 380(smoothed round-trip time) 381is unable to accommodate, while the minimum specifies an 382absolute minimum. 383While a number of 384.Tn TCP 385RFCs suggest a 1 386second minimum, these RFCs tend to focus on streaming behavior, 387and fail to deal with the fact that a 1 second minimum has severe 388detrimental effects over lossy interactive connections, such 389as a 802.11b wireless link, and over very fast but lossy 390connections for those cases not covered by the fast retransmit 391code. 392For this reason, we use 200ms of slop and a near-0 393minimum, which gives us an effective minimum of 200ms (similar to 394.Tn Linux ) . 395.It Va rfc3042 396Enable the Limited Transmit algorithm as described in RFC 3042. 397It helps avoid timeouts on lossy links and also when the congestion window 398is small, as happens on short transfers. 399.It Va rfc3390 400Enable support for RFC 3390, which allows for a variable-sized 401starting congestion window on new connections, depending on the 402maximum segment size. 403This helps throughput in general, but 404particularly affects short transfers and high-bandwidth large 405propagation-delay connections. 406.It Va sack.enable 407Enable support for RFC 2018, TCP Selective Acknowledgment option, 408which allows the receiver to inform the sender about all successfully 409arrived segments, allowing the sender to retransmit the missing segments 410only. 411.It Va sack.maxholes 412Maximum number of SACK holes per connection. 413Defaults to 128. 414.It Va sack.globalmaxholes 415Maximum number of SACK holes per system, across all connections. 416Defaults to 65536. 417.It Va maxtcptw 418When a TCP connection enters the 419.Dv TIME_WAIT 420state, its associated socket structure is freed, since it is of 421negligible size and use, and a new structure is allocated to contain a 422minimal amount of information necessary for sustaining a connection in 423this state, called the compressed TCP TIME_WAIT state. 424Since this structure is smaller than a socket structure, it can save 425a significant amount of system memory. 426The 427.Va net.inet.tcp.maxtcptw 428MIB variable controls the maximum number of these structures allocated. 429By default, it is initialized to 430.Va kern.ipc.maxsockets 431/ 5. 432.It Va nolocaltimewait 433Suppress creating of compressed TCP TIME_WAIT states for connections in 434which both endpoints are local. 435.It Va fast_finwait2_recycle 436Recycle 437.Tn TCP 438.Dv FIN_WAIT_2 439connections faster when the socket is marked as 440.Dv SBS_CANTRCVMORE 441(no user process has the socket open, data received on 442the socket cannot be read). 443The timeout used here is 444.Va finwait2_timeout . 445.It Va finwait2_timeout 446Timeout to use for fast recycling of 447.Tn TCP 448.Dv FIN_WAIT_2 449connections. 450Defaults to 60 seconds. 451.It Va ecn.enable 452Enable support for TCP Explicit Congestion Notification (ECN). 453ECN allows a TCP sender to reduce the transmission rate in order to 454avoid packet drops. 455.It Va ecn.maxretries 456Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 457specific connection. This is needed to help with connection establishment 458when a broken firewall is in the network path. 459.El 460.Sh ERRORS 461A socket operation may fail with one of the following errors returned: 462.Bl -tag -width Er 463.It Bq Er EISCONN 464when trying to establish a connection on a socket which 465already has one; 466.It Bq Er ENOBUFS 467when the system runs out of memory for 468an internal data structure; 469.It Bq Er ETIMEDOUT 470when a connection was dropped 471due to excessive retransmissions; 472.It Bq Er ECONNRESET 473when the remote peer 474forces the connection to be closed; 475.It Bq Er ECONNREFUSED 476when the remote 477peer actively refuses connection establishment (usually because 478no process is listening to the port); 479.It Bq Er EADDRINUSE 480when an attempt 481is made to create a socket with a port which has already been 482allocated; 483.It Bq Er EADDRNOTAVAIL 484when an attempt is made to create a 485socket with a network address for which no network interface 486exists; 487.It Bq Er EAFNOSUPPORT 488when an attempt is made to bind or connect a socket to a multicast 489address. 490.El 491.Sh SEE ALSO 492.Xr getsockopt 2 , 493.Xr socket 2 , 494.Xr sysctl 3 , 495.Xr blackhole 4 , 496.Xr inet 4 , 497.Xr intro 4 , 498.Xr ip 4 , 499.Xr mod_cc 4 , 500.Xr syncache 4 , 501.Xr setkey 8 502.Rs 503.%A "V. Jacobson" 504.%A "R. Braden" 505.%A "D. Borman" 506.%T "TCP Extensions for High Performance" 507.%O "RFC 1323" 508.Re 509.Rs 510.%A "A. Heffernan" 511.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 512.%O "RFC 2385" 513.Re 514.Rs 515.%A "K. Ramakrishnan" 516.%A "S. Floyd" 517.%A "D. Black" 518.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 519.%O "RFC 3168" 520.Re 521.Sh HISTORY 522The 523.Tn TCP 524protocol appeared in 525.Bx 4.2 . 526The RFC 1323 extensions for window scaling and timestamps were added 527in 528.Bx 4.4 . 529The 530.Dv TCP_INFO 531option was introduced in 532.Tn Linux 2.6 533and is 534.Em subject to change . 535