1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. 3.\" Copyright (c) 2010-2011 The FreeBSD Foundation 4.\" All rights reserved. 5.\" 6.\" Portions of this documentation were written at the Centre for Advanced 7.\" Internet Architectures, Swinburne University of Technology, Melbourne, 8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 39.\" $FreeBSD$ 40.\" 41.Dd November 8, 2013 42.Dt TCP 4 43.Os 44.Sh NAME 45.Nm tcp 46.Nd Internet Transmission Control Protocol 47.Sh SYNOPSIS 48.In sys/types.h 49.In sys/socket.h 50.In netinet/in.h 51.In netinet/tcp.h 52.Ft int 53.Fn socket AF_INET SOCK_STREAM 0 54.Sh DESCRIPTION 55The 56.Tn TCP 57protocol provides reliable, flow-controlled, two-way 58transmission of data. 59It is a byte-stream protocol used to 60support the 61.Dv SOCK_STREAM 62abstraction. 63.Tn TCP 64uses the standard 65Internet address format and, in addition, provides a per-host 66collection of 67.Dq "port addresses" . 68Thus, each address is composed 69of an Internet address specifying the host and network, 70with a specific 71.Tn TCP 72port on the host identifying the peer entity. 73.Pp 74Sockets utilizing the 75.Tn TCP 76protocol are either 77.Dq active 78or 79.Dq passive . 80Active sockets initiate connections to passive 81sockets. 82By default, 83.Tn TCP 84sockets are created active; to create a 85passive socket, the 86.Xr listen 2 87system call must be used 88after binding the socket with the 89.Xr bind 2 90system call. 91Only passive sockets may use the 92.Xr accept 2 93call to accept incoming connections. 94Only active sockets may use the 95.Xr connect 2 96call to initiate connections. 97.Pp 98Passive sockets may 99.Dq underspecify 100their location to match 101incoming connection requests from multiple networks. 102This technique, termed 103.Dq "wildcard addressing" , 104allows a single 105server to provide service to clients on multiple networks. 106To create a socket which listens on all networks, the Internet 107address 108.Dv INADDR_ANY 109must be bound. 110The 111.Tn TCP 112port may still be specified 113at this time; if the port is not specified, the system will assign one. 114Once a connection has been established, the socket's address is 115fixed by the peer entity's location. 116The address assigned to the 117socket is the address associated with the network interface 118through which packets are being transmitted and received. 119Normally, this address corresponds to the peer entity's network. 120.Pp 121.Tn TCP 122supports a number of socket options which can be set with 123.Xr setsockopt 2 124and tested with 125.Xr getsockopt 2 : 126.Bl -tag -width ".Dv TCP_CONGESTION" 127.It Dv TCP_INFO 128Information about a socket's underlying TCP session may be retrieved 129by passing the read-only option 130.Dv TCP_INFO 131to 132.Xr getsockopt 2 . 133It accepts a single argument: a pointer to an instance of 134.Vt "struct tcp_info" . 135.Pp 136This API is subject to change; consult the source to determine 137which fields are currently filled out by this option. 138.Fx 139specific additions include 140send window size, 141receive window size, 142and 143bandwidth-controlled window space. 144.It Dv TCP_CONGESTION 145Select or query the congestion control algorithm that TCP will use for the 146connection. 147See 148.Xr mod_cc 4 149for details. 150.It Dv TCP_KEEPINIT 151This 152.Xr setsockopt 2 153option accepts a per-socket timeout argument of 154.Vt "u_int" 155in seconds, for new, non-established 156.Tn TCP 157connections. 158For the global default in milliseconds see 159.Va keepinit 160in the 161.Sx MIB Variables 162section further down. 163.It Dv TCP_KEEPIDLE 164This 165.Xr setsockopt 2 166option accepts an argument of 167.Vt "u_int" 168for the amount of time, in seconds, that the connection must be idle 169before keepalive probes (if enabled) are sent for the connection of this 170socket. 171If set on a listening socket, the value is inherited by the newly created 172socket upon 173.Xr accept 2 . 174For the global default in milliseconds see 175.Va keepidle 176in the 177.Sx MIB Variables 178section further down. 179.It Dv TCP_KEEPINTVL 180This 181.Xr setsockopt 2 182option accepts an argument of 183.Vt "u_int" 184to set the per-socket interval, in seconds, between keepalive probes sent 185to a peer. 186If set on a listening socket, the value is inherited by the newly created 187socket upon 188.Xr accept 2 . 189For the global default in milliseconds see 190.Va keepintvl 191in the 192.Sx MIB Variables 193section further down. 194.It Dv TCP_KEEPCNT 195This 196.Xr setsockopt 2 197option accepts an argument of 198.Vt "u_int" 199and allows a per-socket tuning of the number of probes sent, with no response, 200before the connection will be dropped. 201If set on a listening socket, the value is inherited by the newly created 202socket upon 203.Xr accept 2 . 204For the global default see the 205.Va keepcnt 206in the 207.Sx MIB Variables 208section further down. 209.It Dv TCP_NODELAY 210Under most circumstances, 211.Tn TCP 212sends data when it is presented; 213when outstanding data has not yet been acknowledged, it gathers 214small amounts of output to be sent in a single packet once 215an acknowledgement is received. 216For a small number of clients, such as window systems 217that send a stream of mouse events which receive no replies, 218this packetization may cause significant delays. 219The boolean option 220.Dv TCP_NODELAY 221defeats this algorithm. 222.It Dv TCP_MAXSEG 223By default, a sender- and 224.No receiver- Ns Tn TCP 225will negotiate among themselves to determine the maximum segment size 226to be used for each connection. 227The 228.Dv TCP_MAXSEG 229option allows the user to determine the result of this negotiation, 230and to reduce it if desired. 231.It Dv TCP_NOOPT 232.Tn TCP 233usually sends a number of options in each packet, corresponding to 234various 235.Tn TCP 236extensions which are provided in this implementation. 237The boolean option 238.Dv TCP_NOOPT 239is provided to disable 240.Tn TCP 241option use on a per-connection basis. 242.It Dv TCP_NOPUSH 243By convention, the 244.No sender- Ns Tn TCP 245will set the 246.Dq push 247bit, and begin transmission immediately (if permitted) at the end of 248every user call to 249.Xr write 2 250or 251.Xr writev 2 . 252When this option is set to a non-zero value, 253.Tn TCP 254will delay sending any data at all until either the socket is closed, 255or the internal send buffer is filled. 256.It Dv TCP_MD5SIG 257This option enables the use of MD5 digests (also known as TCP-MD5) 258on writes to the specified socket. 259Outgoing traffic is digested; 260digests on incoming traffic are verified if the 261.Va net.inet.tcp.signature_verify_input 262sysctl is nonzero. 263The current default behavior for the system is to respond to a system 264advertising this option with TCP-MD5; this may change. 265.Pp 266One common use for this in a 267.Fx 268router deployment is to enable 269based routers to interwork with Cisco equipment at peering points. 270Support for this feature conforms to RFC 2385. 271Only IPv4 272.Pq Dv AF_INET 273sessions are supported. 274.Pp 275In order for this option to function correctly, it is necessary for the 276administrator to add a tcp-md5 key entry to the system's security 277associations database (SADB) using the 278.Xr setkey 8 279utility. 280This entry must have an SPI of 0x1000 and can therefore only be specified 281on a per-host basis at this time. 282.Pp 283If an SADB entry cannot be found for the destination, the outgoing traffic 284will have an invalid digest option prepended, and the following error message 285will be visible on the system console: 286.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 287.El 288.Pp 289The option level for the 290.Xr setsockopt 2 291call is the protocol number for 292.Tn TCP , 293available from 294.Xr getprotobyname 3 , 295or 296.Dv IPPROTO_TCP . 297All options are declared in 298.In netinet/tcp.h . 299.Pp 300Options at the 301.Tn IP 302transport level may be used with 303.Tn TCP ; 304see 305.Xr ip 4 . 306Incoming connection requests that are source-routed are noted, 307and the reverse source route is used in responding. 308.Pp 309The default congestion control algorithm for 310.Tn TCP 311is 312.Xr cc_newreno 4 . 313Other congestion control algorithms can be made available using the 314.Xr mod_cc 4 315framework. 316.Ss MIB Variables 317The 318.Tn TCP 319protocol implements a number of variables in the 320.Va net.inet.tcp 321branch of the 322.Xr sysctl 3 323MIB. 324.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 325.It Dv TCPCTL_DO_RFC1323 326.Pq Va rfc1323 327Implement the window scaling and timestamp options of RFC 1323 328(default is true). 329.It Dv TCPCTL_MSSDFLT 330.Pq Va mssdflt 331The default value used for the maximum segment size 332.Pq Dq MSS 333when no advice to the contrary is received from MSS negotiation. 334.It Dv TCPCTL_SENDSPACE 335.Pq Va sendspace 336Maximum 337.Tn TCP 338send window. 339.It Dv TCPCTL_RECVSPACE 340.Pq Va recvspace 341Maximum 342.Tn TCP 343receive window. 344.It Va log_in_vain 345Log any connection attempts to ports where there is not a socket 346accepting connections. 347The value of 1 limits the logging to 348.Tn SYN 349(connection establishment) packets only. 350That of 2 results in any 351.Tn TCP 352packets to closed ports being logged. 353Any value unlisted above disables the logging 354(default is 0, i.e., the logging is disabled). 355.It Va msl 356The Maximum Segment Lifetime, in milliseconds, for a packet. 357.It Va keepinit 358Timeout, in milliseconds, for new, non-established 359.Tn TCP 360connections. 361The default is 75000 msec. 362.It Va keepidle 363Amount of time, in milliseconds, that the connection must be idle 364before keepalive probes (if enabled) are sent. 365The default is 7200000 msec (2 hours). 366.It Va keepintvl 367The interval, in milliseconds, between keepalive probes sent to remote 368machines, when no response is received on a 369.Va keepidle 370probe. 371The default is 75000 msec. 372.It Va keepcnt 373Number of probes sent, with no response, before a connection 374is dropped. 375The default is 8 packets. 376.It Va always_keepalive 377Assume that 378.Dv SO_KEEPALIVE 379is set on all 380.Tn TCP 381connections, the kernel will 382periodically send a packet to the remote host to verify the connection 383is still up. 384.It Va icmp_may_rst 385Certain 386.Tn ICMP 387unreachable messages may abort connections in 388.Tn SYN-SENT 389state. 390.It Va do_tcpdrain 391Flush packets in the 392.Tn TCP 393reassembly queue if the system is low on mbufs. 394.It Va blackhole 395If enabled, disable sending of RST when a connection is attempted 396to a port where there is not a socket accepting connections. 397See 398.Xr blackhole 4 . 399.It Va delayed_ack 400Delay ACK to try and piggyback it onto a data packet. 401.It Va delacktime 402Maximum amount of time, in milliseconds, before a delayed ACK is sent. 403.It Va path_mtu_discovery 404Enable Path MTU Discovery. 405.It Va tcbhashsize 406Size of the 407.Tn TCP 408control-block hash table 409(read-only). 410This may be tuned using the kernel option 411.Dv TCBHASHSIZE 412or by setting 413.Va net.inet.tcp.tcbhashsize 414in the 415.Xr loader 8 . 416.It Va pcbcount 417Number of active process control blocks 418(read-only). 419.It Va syncookies 420Determines whether or not 421.Tn SYN 422cookies should be generated for outbound 423.Tn SYN-ACK 424packets. 425.Tn SYN 426cookies are a great help during 427.Tn SYN 428flood attacks, and are enabled by default. 429(See 430.Xr syncookies 4 . ) 431.It Va isn_reseed_interval 432The interval (in seconds) specifying how often the secret data used in 433RFC 1948 initial sequence number calculations should be reseeded. 434By default, this variable is set to zero, indicating that 435no reseeding will occur. 436Reseeding should not be necessary, and will break 437.Dv TIME_WAIT 438recycling for a few minutes. 439.It Va rexmit_min , rexmit_slop 440Adjust the retransmit timer calculation for 441.Tn TCP . 442The slop is 443typically added to the raw calculation to take into account 444occasional variances that the 445.Tn SRTT 446(smoothed round-trip time) 447is unable to accommodate, while the minimum specifies an 448absolute minimum. 449While a number of 450.Tn TCP 451RFCs suggest a 1 452second minimum, these RFCs tend to focus on streaming behavior, 453and fail to deal with the fact that a 1 second minimum has severe 454detrimental effects over lossy interactive connections, such 455as a 802.11b wireless link, and over very fast but lossy 456connections for those cases not covered by the fast retransmit 457code. 458For this reason, we use 200ms of slop and a near-0 459minimum, which gives us an effective minimum of 200ms (similar to 460.Tn Linux ) . 461.It Va rfc3042 462Enable the Limited Transmit algorithm as described in RFC 3042. 463It helps avoid timeouts on lossy links and also when the congestion window 464is small, as happens on short transfers. 465.It Va rfc3390 466Enable support for RFC 3390, which allows for a variable-sized 467starting congestion window on new connections, depending on the 468maximum segment size. 469This helps throughput in general, but 470particularly affects short transfers and high-bandwidth large 471propagation-delay connections. 472.It Va sack.enable 473Enable support for RFC 2018, TCP Selective Acknowledgment option, 474which allows the receiver to inform the sender about all successfully 475arrived segments, allowing the sender to retransmit the missing segments 476only. 477.It Va sack.maxholes 478Maximum number of SACK holes per connection. 479Defaults to 128. 480.It Va sack.globalmaxholes 481Maximum number of SACK holes per system, across all connections. 482Defaults to 65536. 483.It Va maxtcptw 484When a TCP connection enters the 485.Dv TIME_WAIT 486state, its associated socket structure is freed, since it is of 487negligible size and use, and a new structure is allocated to contain a 488minimal amount of information necessary for sustaining a connection in 489this state, called the compressed TCP TIME_WAIT state. 490Since this structure is smaller than a socket structure, it can save 491a significant amount of system memory. 492The 493.Va net.inet.tcp.maxtcptw 494MIB variable controls the maximum number of these structures allocated. 495By default, it is initialized to 496.Va kern.ipc.maxsockets 497/ 5. 498.It Va nolocaltimewait 499Suppress creating of compressed TCP TIME_WAIT states for connections in 500which both endpoints are local. 501.It Va fast_finwait2_recycle 502Recycle 503.Tn TCP 504.Dv FIN_WAIT_2 505connections faster when the socket is marked as 506.Dv SBS_CANTRCVMORE 507(no user process has the socket open, data received on 508the socket cannot be read). 509The timeout used here is 510.Va finwait2_timeout . 511.It Va finwait2_timeout 512Timeout to use for fast recycling of 513.Tn TCP 514.Dv FIN_WAIT_2 515connections. 516Defaults to 60 seconds. 517.It Va ecn.enable 518Enable support for TCP Explicit Congestion Notification (ECN). 519ECN allows a TCP sender to reduce the transmission rate in order to 520avoid packet drops. 521.It Va ecn.maxretries 522Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 523specific connection. This is needed to help with connection establishment 524when a broken firewall is in the network path. 525.El 526.Sh ERRORS 527A socket operation may fail with one of the following errors returned: 528.Bl -tag -width Er 529.It Bq Er EISCONN 530when trying to establish a connection on a socket which 531already has one; 532.It Bq Er ENOBUFS 533when the system runs out of memory for 534an internal data structure; 535.It Bq Er ETIMEDOUT 536when a connection was dropped 537due to excessive retransmissions; 538.It Bq Er ECONNRESET 539when the remote peer 540forces the connection to be closed; 541.It Bq Er ECONNREFUSED 542when the remote 543peer actively refuses connection establishment (usually because 544no process is listening to the port); 545.It Bq Er EADDRINUSE 546when an attempt 547is made to create a socket with a port which has already been 548allocated; 549.It Bq Er EADDRNOTAVAIL 550when an attempt is made to create a 551socket with a network address for which no network interface 552exists; 553.It Bq Er EAFNOSUPPORT 554when an attempt is made to bind or connect a socket to a multicast 555address. 556.El 557.Sh SEE ALSO 558.Xr getsockopt 2 , 559.Xr socket 2 , 560.Xr sysctl 3 , 561.Xr blackhole 4 , 562.Xr inet 4 , 563.Xr intro 4 , 564.Xr ip 4 , 565.Xr mod_cc 4 , 566.Xr siftr 4 , 567.Xr syncache 4 , 568.Xr setkey 8 569.Rs 570.%A "V. Jacobson" 571.%A "R. Braden" 572.%A "D. Borman" 573.%T "TCP Extensions for High Performance" 574.%O "RFC 1323" 575.Re 576.Rs 577.%A "A. Heffernan" 578.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 579.%O "RFC 2385" 580.Re 581.Rs 582.%A "K. Ramakrishnan" 583.%A "S. Floyd" 584.%A "D. Black" 585.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 586.%O "RFC 3168" 587.Re 588.Sh HISTORY 589The 590.Tn TCP 591protocol appeared in 592.Bx 4.2 . 593The RFC 1323 extensions for window scaling and timestamps were added 594in 595.Bx 4.4 . 596The 597.Dv TCP_INFO 598option was introduced in 599.Tn Linux 2.6 600and is 601.Em subject to change . 602