1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. 3.\" Copyright (c) 2010-2011 The FreeBSD Foundation 4.\" All rights reserved. 5.\" 6.\" Portions of this documentation were written at the Centre for Advanced 7.\" Internet Architectures, Swinburne University of Technology, Melbourne, 8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 39.\" $FreeBSD$ 40.\" 41.Dd February 5, 2012 42.Dt TCP 4 43.Os 44.Sh NAME 45.Nm tcp 46.Nd Internet Transmission Control Protocol 47.Sh SYNOPSIS 48.In sys/types.h 49.In sys/socket.h 50.In netinet/in.h 51.Ft int 52.Fn socket AF_INET SOCK_STREAM 0 53.Sh DESCRIPTION 54The 55.Tn TCP 56protocol provides reliable, flow-controlled, two-way 57transmission of data. 58It is a byte-stream protocol used to 59support the 60.Dv SOCK_STREAM 61abstraction. 62.Tn TCP 63uses the standard 64Internet address format and, in addition, provides a per-host 65collection of 66.Dq "port addresses" . 67Thus, each address is composed 68of an Internet address specifying the host and network, 69with a specific 70.Tn TCP 71port on the host identifying the peer entity. 72.Pp 73Sockets utilizing the 74.Tn TCP 75protocol are either 76.Dq active 77or 78.Dq passive . 79Active sockets initiate connections to passive 80sockets. 81By default, 82.Tn TCP 83sockets are created active; to create a 84passive socket, the 85.Xr listen 2 86system call must be used 87after binding the socket with the 88.Xr bind 2 89system call. 90Only passive sockets may use the 91.Xr accept 2 92call to accept incoming connections. 93Only active sockets may use the 94.Xr connect 2 95call to initiate connections. 96.Pp 97Passive sockets may 98.Dq underspecify 99their location to match 100incoming connection requests from multiple networks. 101This technique, termed 102.Dq "wildcard addressing" , 103allows a single 104server to provide service to clients on multiple networks. 105To create a socket which listens on all networks, the Internet 106address 107.Dv INADDR_ANY 108must be bound. 109The 110.Tn TCP 111port may still be specified 112at this time; if the port is not specified, the system will assign one. 113Once a connection has been established, the socket's address is 114fixed by the peer entity's location. 115The address assigned to the 116socket is the address associated with the network interface 117through which packets are being transmitted and received. 118Normally, this address corresponds to the peer entity's network. 119.Pp 120.Tn TCP 121supports a number of socket options which can be set with 122.Xr setsockopt 2 123and tested with 124.Xr getsockopt 2 : 125.Bl -tag -width ".Dv TCP_CONGESTION" 126.It Dv TCP_INFO 127Information about a socket's underlying TCP session may be retrieved 128by passing the read-only option 129.Dv TCP_INFO 130to 131.Xr getsockopt 2 . 132It accepts a single argument: a pointer to an instance of 133.Vt "struct tcp_info" . 134.Pp 135This API is subject to change; consult the source to determine 136which fields are currently filled out by this option. 137.Fx 138specific additions include 139send window size, 140receive window size, 141and 142bandwidth-controlled window space. 143.It Dv TCP_CONGESTION 144Select or query the congestion control algorithm that TCP will use for the 145connection. 146See 147.Xr mod_cc 4 148for details. 149.It Dv TCP_KEEPINIT 150This write-only 151.Xr setsockopt 2 152option accepts a per-socket timeout argument of 153.Vt "u_int" 154in seconds, for new, non-established 155.Tn TCP 156connections. 157For the global default in milliseconds see 158.Va keepinit 159in the 160.Sx MIB Variables 161section further down. 162.It Dv TCP_KEEPIDLE 163This write-only 164.Xr setsockopt 2 165option accepts an argument of 166.Vt "u_int" 167for the amount of time, in seconds, that the connection must be idle 168before keepalive probes (if enabled) are sent for the connection of this 169socket. 170If set on a listening socket, the value is inherited by the newly created 171socket upon 172.Xr accept 2 . 173For the global default in milliseconds see 174.Va keepidle 175in the 176.Sx MIB Variables 177section further down. 178.It Dv TCP_KEEPINTVL 179This write-only 180.Xr setsockopt 2 181option accepts an argument of 182.Vt "u_int" 183to set the per-socket interval, in seconds, between keepalive probes sent 184to a peer. 185If set on a listening socket, the value is inherited by the newly created 186socket upon 187.Xr accept 2 . 188For the global default in milliseconds see 189.Va keepintvl 190in the 191.Sx MIB Variables 192section further down. 193.It Dv TCP_KEEPCNT 194This write-only 195.Xr setsockopt 2 196option accepts an argument of 197.Vt "u_int" 198and allows a per-socket tuning of the number of probes sent, with no response, 199before the connection will be dropped. 200If set on a listening socket, the value is inherited by the newly created 201socket upon 202.Xr accept 2 . 203For the global default see the 204.Va keepcnt 205in the 206.Sx MIB Variables 207section further down. 208.It Dv TCP_NODELAY 209Under most circumstances, 210.Tn TCP 211sends data when it is presented; 212when outstanding data has not yet been acknowledged, it gathers 213small amounts of output to be sent in a single packet once 214an acknowledgement is received. 215For a small number of clients, such as window systems 216that send a stream of mouse events which receive no replies, 217this packetization may cause significant delays. 218The boolean option 219.Dv TCP_NODELAY 220defeats this algorithm. 221.It Dv TCP_MAXSEG 222By default, a sender- and 223.No receiver- Ns Tn TCP 224will negotiate among themselves to determine the maximum segment size 225to be used for each connection. 226The 227.Dv TCP_MAXSEG 228option allows the user to determine the result of this negotiation, 229and to reduce it if desired. 230.It Dv TCP_NOOPT 231.Tn TCP 232usually sends a number of options in each packet, corresponding to 233various 234.Tn TCP 235extensions which are provided in this implementation. 236The boolean option 237.Dv TCP_NOOPT 238is provided to disable 239.Tn TCP 240option use on a per-connection basis. 241.It Dv TCP_NOPUSH 242By convention, the 243.No sender- Ns Tn TCP 244will set the 245.Dq push 246bit, and begin transmission immediately (if permitted) at the end of 247every user call to 248.Xr write 2 249or 250.Xr writev 2 . 251When this option is set to a non-zero value, 252.Tn TCP 253will delay sending any data at all until either the socket is closed, 254or the internal send buffer is filled. 255.It Dv TCP_MD5SIG 256This option enables the use of MD5 digests (also known as TCP-MD5) 257on writes to the specified socket. 258In the current release, only outgoing traffic is digested; 259digests on incoming traffic are not verified. 260The current default behavior for the system is to respond to a system 261advertising this option with TCP-MD5; this may change. 262.Pp 263One common use for this in a 264.Fx 265router deployment is to enable 266based routers to interwork with Cisco equipment at peering points. 267Support for this feature conforms to RFC 2385. 268Only IPv4 269.Pq Dv AF_INET 270sessions are supported. 271.Pp 272In order for this option to function correctly, it is necessary for the 273administrator to add a tcp-md5 key entry to the system's security 274associations database (SADB) using the 275.Xr setkey 8 276utility. 277This entry must have an SPI of 0x1000 and can therefore only be specified 278on a per-host basis at this time. 279.Pp 280If an SADB entry cannot be found for the destination, the outgoing traffic 281will have an invalid digest option prepended, and the following error message 282will be visible on the system console: 283.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 284.El 285.Pp 286The option level for the 287.Xr setsockopt 2 288call is the protocol number for 289.Tn TCP , 290available from 291.Xr getprotobyname 3 , 292or 293.Dv IPPROTO_TCP . 294All options are declared in 295.In netinet/tcp.h . 296.Pp 297Options at the 298.Tn IP 299transport level may be used with 300.Tn TCP ; 301see 302.Xr ip 4 . 303Incoming connection requests that are source-routed are noted, 304and the reverse source route is used in responding. 305.Pp 306The default congestion control algorithm for 307.Tn TCP 308is 309.Xr cc_newreno 4 . 310Other congestion control algorithms can be made available using the 311.Xr mod_cc 4 312framework. 313.Ss MIB Variables 314The 315.Tn TCP 316protocol implements a number of variables in the 317.Va net.inet.tcp 318branch of the 319.Xr sysctl 3 320MIB. 321.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 322.It Dv TCPCTL_DO_RFC1323 323.Pq Va rfc1323 324Implement the window scaling and timestamp options of RFC 1323 325(default is true). 326.It Dv TCPCTL_MSSDFLT 327.Pq Va mssdflt 328The default value used for the maximum segment size 329.Pq Dq MSS 330when no advice to the contrary is received from MSS negotiation. 331.It Dv TCPCTL_SENDSPACE 332.Pq Va sendspace 333Maximum 334.Tn TCP 335send window. 336.It Dv TCPCTL_RECVSPACE 337.Pq Va recvspace 338Maximum 339.Tn TCP 340receive window. 341.It Va log_in_vain 342Log any connection attempts to ports where there is not a socket 343accepting connections. 344The value of 1 limits the logging to 345.Tn SYN 346(connection establishment) packets only. 347That of 2 results in any 348.Tn TCP 349packets to closed ports being logged. 350Any value unlisted above disables the logging 351(default is 0, i.e., the logging is disabled). 352.It Va msl 353The Maximum Segment Lifetime, in milliseconds, for a packet. 354.It Va keepinit 355Timeout, in milliseconds, for new, non-established 356.Tn TCP 357connections. 358The default is 75000 msec. 359.It Va keepidle 360Amount of time, in milliseconds, that the connection must be idle 361before keepalive probes (if enabled) are sent. 362The default is 7200000 msec (2 hours). 363.It Va keepintvl 364The interval, in milliseconds, between keepalive probes sent to remote 365machines, when no response is received on a 366.Va keepidle 367probe. 368The default is 75000 msec. 369.It Va keepcnt 370Number of probes sent, with no response, before a connection 371is dropped. 372The default is 8 packets. 373.It Va always_keepalive 374Assume that 375.Dv SO_KEEPALIVE 376is set on all 377.Tn TCP 378connections, the kernel will 379periodically send a packet to the remote host to verify the connection 380is still up. 381.It Va icmp_may_rst 382Certain 383.Tn ICMP 384unreachable messages may abort connections in 385.Tn SYN-SENT 386state. 387.It Va do_tcpdrain 388Flush packets in the 389.Tn TCP 390reassembly queue if the system is low on mbufs. 391.It Va blackhole 392If enabled, disable sending of RST when a connection is attempted 393to a port where there is not a socket accepting connections. 394See 395.Xr blackhole 4 . 396.It Va delayed_ack 397Delay ACK to try and piggyback it onto a data packet. 398.It Va delacktime 399Maximum amount of time, in milliseconds, before a delayed ACK is sent. 400.It Va path_mtu_discovery 401Enable Path MTU Discovery. 402.It Va tcbhashsize 403Size of the 404.Tn TCP 405control-block hash table 406(read-only). 407This may be tuned using the kernel option 408.Dv TCBHASHSIZE 409or by setting 410.Va net.inet.tcp.tcbhashsize 411in the 412.Xr loader 8 . 413.It Va pcbcount 414Number of active process control blocks 415(read-only). 416.It Va syncookies 417Determines whether or not 418.Tn SYN 419cookies should be generated for outbound 420.Tn SYN-ACK 421packets. 422.Tn SYN 423cookies are a great help during 424.Tn SYN 425flood attacks, and are enabled by default. 426(See 427.Xr syncookies 4 . ) 428.It Va isn_reseed_interval 429The interval (in seconds) specifying how often the secret data used in 430RFC 1948 initial sequence number calculations should be reseeded. 431By default, this variable is set to zero, indicating that 432no reseeding will occur. 433Reseeding should not be necessary, and will break 434.Dv TIME_WAIT 435recycling for a few minutes. 436.It Va rexmit_min , rexmit_slop 437Adjust the retransmit timer calculation for 438.Tn TCP . 439The slop is 440typically added to the raw calculation to take into account 441occasional variances that the 442.Tn SRTT 443(smoothed round-trip time) 444is unable to accommodate, while the minimum specifies an 445absolute minimum. 446While a number of 447.Tn TCP 448RFCs suggest a 1 449second minimum, these RFCs tend to focus on streaming behavior, 450and fail to deal with the fact that a 1 second minimum has severe 451detrimental effects over lossy interactive connections, such 452as a 802.11b wireless link, and over very fast but lossy 453connections for those cases not covered by the fast retransmit 454code. 455For this reason, we use 200ms of slop and a near-0 456minimum, which gives us an effective minimum of 200ms (similar to 457.Tn Linux ) . 458.It Va rfc3042 459Enable the Limited Transmit algorithm as described in RFC 3042. 460It helps avoid timeouts on lossy links and also when the congestion window 461is small, as happens on short transfers. 462.It Va rfc3390 463Enable support for RFC 3390, which allows for a variable-sized 464starting congestion window on new connections, depending on the 465maximum segment size. 466This helps throughput in general, but 467particularly affects short transfers and high-bandwidth large 468propagation-delay connections. 469.It Va sack.enable 470Enable support for RFC 2018, TCP Selective Acknowledgment option, 471which allows the receiver to inform the sender about all successfully 472arrived segments, allowing the sender to retransmit the missing segments 473only. 474.It Va sack.maxholes 475Maximum number of SACK holes per connection. 476Defaults to 128. 477.It Va sack.globalmaxholes 478Maximum number of SACK holes per system, across all connections. 479Defaults to 65536. 480.It Va maxtcptw 481When a TCP connection enters the 482.Dv TIME_WAIT 483state, its associated socket structure is freed, since it is of 484negligible size and use, and a new structure is allocated to contain a 485minimal amount of information necessary for sustaining a connection in 486this state, called the compressed TCP TIME_WAIT state. 487Since this structure is smaller than a socket structure, it can save 488a significant amount of system memory. 489The 490.Va net.inet.tcp.maxtcptw 491MIB variable controls the maximum number of these structures allocated. 492By default, it is initialized to 493.Va kern.ipc.maxsockets 494/ 5. 495.It Va nolocaltimewait 496Suppress creating of compressed TCP TIME_WAIT states for connections in 497which both endpoints are local. 498.It Va fast_finwait2_recycle 499Recycle 500.Tn TCP 501.Dv FIN_WAIT_2 502connections faster when the socket is marked as 503.Dv SBS_CANTRCVMORE 504(no user process has the socket open, data received on 505the socket cannot be read). 506The timeout used here is 507.Va finwait2_timeout . 508.It Va finwait2_timeout 509Timeout to use for fast recycling of 510.Tn TCP 511.Dv FIN_WAIT_2 512connections. 513Defaults to 60 seconds. 514.It Va ecn.enable 515Enable support for TCP Explicit Congestion Notification (ECN). 516ECN allows a TCP sender to reduce the transmission rate in order to 517avoid packet drops. 518.It Va ecn.maxretries 519Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 520specific connection. This is needed to help with connection establishment 521when a broken firewall is in the network path. 522.El 523.Sh ERRORS 524A socket operation may fail with one of the following errors returned: 525.Bl -tag -width Er 526.It Bq Er EISCONN 527when trying to establish a connection on a socket which 528already has one; 529.It Bq Er ENOBUFS 530when the system runs out of memory for 531an internal data structure; 532.It Bq Er ETIMEDOUT 533when a connection was dropped 534due to excessive retransmissions; 535.It Bq Er ECONNRESET 536when the remote peer 537forces the connection to be closed; 538.It Bq Er ECONNREFUSED 539when the remote 540peer actively refuses connection establishment (usually because 541no process is listening to the port); 542.It Bq Er EADDRINUSE 543when an attempt 544is made to create a socket with a port which has already been 545allocated; 546.It Bq Er EADDRNOTAVAIL 547when an attempt is made to create a 548socket with a network address for which no network interface 549exists; 550.It Bq Er EAFNOSUPPORT 551when an attempt is made to bind or connect a socket to a multicast 552address. 553.El 554.Sh SEE ALSO 555.Xr getsockopt 2 , 556.Xr socket 2 , 557.Xr sysctl 3 , 558.Xr blackhole 4 , 559.Xr inet 4 , 560.Xr intro 4 , 561.Xr ip 4 , 562.Xr mod_cc 4 , 563.Xr syncache 4 , 564.Xr setkey 8 565.Rs 566.%A "V. Jacobson" 567.%A "R. Braden" 568.%A "D. Borman" 569.%T "TCP Extensions for High Performance" 570.%O "RFC 1323" 571.Re 572.Rs 573.%A "A. Heffernan" 574.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 575.%O "RFC 2385" 576.Re 577.Rs 578.%A "K. Ramakrishnan" 579.%A "S. Floyd" 580.%A "D. Black" 581.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 582.%O "RFC 3168" 583.Re 584.Sh HISTORY 585The 586.Tn TCP 587protocol appeared in 588.Bx 4.2 . 589The RFC 1323 extensions for window scaling and timestamps were added 590in 591.Bx 4.4 . 592The 593.Dv TCP_INFO 594option was introduced in 595.Tn Linux 2.6 596and is 597.Em subject to change . 598