1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. 3.\" Copyright (c) 2010-2011 The FreeBSD Foundation 4.\" All rights reserved. 5.\" 6.\" Portions of this documentation were written at the Centre for Advanced 7.\" Internet Architectures, Swinburne University of Technology, Melbourne, 8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. Neither the name of the University nor the names of its contributors 19.\" may be used to endorse or promote products derived from this software 20.\" without specific prior written permission. 21.\" 22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 25.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 32.\" SUCH DAMAGE. 33.\" 34.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 35.\" $FreeBSD$ 36.\" 37.Dd October 13, 2014 38.Dt TCP 4 39.Os 40.Sh NAME 41.Nm tcp 42.Nd Internet Transmission Control Protocol 43.Sh SYNOPSIS 44.In sys/types.h 45.In sys/socket.h 46.In netinet/in.h 47.In netinet/tcp.h 48.Ft int 49.Fn socket AF_INET SOCK_STREAM 0 50.Sh DESCRIPTION 51The 52.Tn TCP 53protocol provides reliable, flow-controlled, two-way 54transmission of data. 55It is a byte-stream protocol used to 56support the 57.Dv SOCK_STREAM 58abstraction. 59.Tn TCP 60uses the standard 61Internet address format and, in addition, provides a per-host 62collection of 63.Dq "port addresses" . 64Thus, each address is composed 65of an Internet address specifying the host and network, 66with a specific 67.Tn TCP 68port on the host identifying the peer entity. 69.Pp 70Sockets utilizing the 71.Tn TCP 72protocol are either 73.Dq active 74or 75.Dq passive . 76Active sockets initiate connections to passive 77sockets. 78By default, 79.Tn TCP 80sockets are created active; to create a 81passive socket, the 82.Xr listen 2 83system call must be used 84after binding the socket with the 85.Xr bind 2 86system call. 87Only passive sockets may use the 88.Xr accept 2 89call to accept incoming connections. 90Only active sockets may use the 91.Xr connect 2 92call to initiate connections. 93.Pp 94Passive sockets may 95.Dq underspecify 96their location to match 97incoming connection requests from multiple networks. 98This technique, termed 99.Dq "wildcard addressing" , 100allows a single 101server to provide service to clients on multiple networks. 102To create a socket which listens on all networks, the Internet 103address 104.Dv INADDR_ANY 105must be bound. 106The 107.Tn TCP 108port may still be specified 109at this time; if the port is not specified, the system will assign one. 110Once a connection has been established, the socket's address is 111fixed by the peer entity's location. 112The address assigned to the 113socket is the address associated with the network interface 114through which packets are being transmitted and received. 115Normally, this address corresponds to the peer entity's network. 116.Pp 117.Tn TCP 118supports a number of socket options which can be set with 119.Xr setsockopt 2 120and tested with 121.Xr getsockopt 2 : 122.Bl -tag -width ".Dv TCP_CONGESTION" 123.It Dv TCP_INFO 124Information about a socket's underlying TCP session may be retrieved 125by passing the read-only option 126.Dv TCP_INFO 127to 128.Xr getsockopt 2 . 129It accepts a single argument: a pointer to an instance of 130.Vt "struct tcp_info" . 131.Pp 132This API is subject to change; consult the source to determine 133which fields are currently filled out by this option. 134.Fx 135specific additions include 136send window size, 137receive window size, 138and 139bandwidth-controlled window space. 140.It Dv TCP_CONGESTION 141Select or query the congestion control algorithm that TCP will use for the 142connection. 143See 144.Xr mod_cc 4 145for details. 146.It Dv TCP_KEEPINIT 147This 148.Xr setsockopt 2 149option accepts a per-socket timeout argument of 150.Vt "u_int" 151in seconds, for new, non-established 152.Tn TCP 153connections. 154For the global default in milliseconds see 155.Va keepinit 156in the 157.Sx MIB Variables 158section further down. 159.It Dv TCP_KEEPIDLE 160This 161.Xr setsockopt 2 162option accepts an argument of 163.Vt "u_int" 164for the amount of time, in seconds, that the connection must be idle 165before keepalive probes (if enabled) are sent for the connection of this 166socket. 167If set on a listening socket, the value is inherited by the newly created 168socket upon 169.Xr accept 2 . 170For the global default in milliseconds see 171.Va keepidle 172in the 173.Sx MIB Variables 174section further down. 175.It Dv TCP_KEEPINTVL 176This 177.Xr setsockopt 2 178option accepts an argument of 179.Vt "u_int" 180to set the per-socket interval, in seconds, between keepalive probes sent 181to a peer. 182If set on a listening socket, the value is inherited by the newly created 183socket upon 184.Xr accept 2 . 185For the global default in milliseconds see 186.Va keepintvl 187in the 188.Sx MIB Variables 189section further down. 190.It Dv TCP_KEEPCNT 191This 192.Xr setsockopt 2 193option accepts an argument of 194.Vt "u_int" 195and allows a per-socket tuning of the number of probes sent, with no response, 196before the connection will be dropped. 197If set on a listening socket, the value is inherited by the newly created 198socket upon 199.Xr accept 2 . 200For the global default see the 201.Va keepcnt 202in the 203.Sx MIB Variables 204section further down. 205.It Dv TCP_NODELAY 206Under most circumstances, 207.Tn TCP 208sends data when it is presented; 209when outstanding data has not yet been acknowledged, it gathers 210small amounts of output to be sent in a single packet once 211an acknowledgement is received. 212For a small number of clients, such as window systems 213that send a stream of mouse events which receive no replies, 214this packetization may cause significant delays. 215The boolean option 216.Dv TCP_NODELAY 217defeats this algorithm. 218.It Dv TCP_MAXSEG 219By default, a sender- and 220.No receiver- Ns Tn TCP 221will negotiate among themselves to determine the maximum segment size 222to be used for each connection. 223The 224.Dv TCP_MAXSEG 225option allows the user to determine the result of this negotiation, 226and to reduce it if desired. 227.It Dv TCP_NOOPT 228.Tn TCP 229usually sends a number of options in each packet, corresponding to 230various 231.Tn TCP 232extensions which are provided in this implementation. 233The boolean option 234.Dv TCP_NOOPT 235is provided to disable 236.Tn TCP 237option use on a per-connection basis. 238.It Dv TCP_NOPUSH 239By convention, the 240.No sender- Ns Tn TCP 241will set the 242.Dq push 243bit, and begin transmission immediately (if permitted) at the end of 244every user call to 245.Xr write 2 246or 247.Xr writev 2 . 248When this option is set to a non-zero value, 249.Tn TCP 250will delay sending any data at all until either the socket is closed, 251or the internal send buffer is filled. 252.It Dv TCP_MD5SIG 253This option enables the use of MD5 digests (also known as TCP-MD5) 254on writes to the specified socket. 255Outgoing traffic is digested; 256digests on incoming traffic are verified if the 257.Va net.inet.tcp.signature_verify_input 258sysctl is nonzero. 259The current default behavior for the system is to respond to a system 260advertising this option with TCP-MD5; this may change. 261.Pp 262One common use for this in a 263.Fx 264router deployment is to enable 265based routers to interwork with Cisco equipment at peering points. 266Support for this feature conforms to RFC 2385. 267Only IPv4 268.Pq Dv AF_INET 269sessions are supported. 270.Pp 271In order for this option to function correctly, it is necessary for the 272administrator to add a tcp-md5 key entry to the system's security 273associations database (SADB) using the 274.Xr setkey 8 275utility. 276This entry must have an SPI of 0x1000 and can therefore only be specified 277on a per-host basis at this time. 278.Pp 279If an SADB entry cannot be found for the destination, the outgoing traffic 280will have an invalid digest option prepended, and the following error message 281will be visible on the system console: 282.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 283.El 284.Pp 285The option level for the 286.Xr setsockopt 2 287call is the protocol number for 288.Tn TCP , 289available from 290.Xr getprotobyname 3 , 291or 292.Dv IPPROTO_TCP . 293All options are declared in 294.In netinet/tcp.h . 295.Pp 296Options at the 297.Tn IP 298transport level may be used with 299.Tn TCP ; 300see 301.Xr ip 4 . 302Incoming connection requests that are source-routed are noted, 303and the reverse source route is used in responding. 304.Pp 305The default congestion control algorithm for 306.Tn TCP 307is 308.Xr cc_newreno 4 . 309Other congestion control algorithms can be made available using the 310.Xr mod_cc 4 311framework. 312.Ss MIB Variables 313The 314.Tn TCP 315protocol implements a number of variables in the 316.Va net.inet.tcp 317branch of the 318.Xr sysctl 3 319MIB. 320.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 321.It Dv TCPCTL_DO_RFC1323 322.Pq Va rfc1323 323Implement the window scaling and timestamp options of RFC 1323 324(default is true). 325.It Dv TCPCTL_MSSDFLT 326.Pq Va mssdflt 327The default value used for the maximum segment size 328.Pq Dq MSS 329when no advice to the contrary is received from MSS negotiation. 330.It Dv TCPCTL_SENDSPACE 331.Pq Va sendspace 332Maximum 333.Tn TCP 334send window. 335.It Dv TCPCTL_RECVSPACE 336.Pq Va recvspace 337Maximum 338.Tn TCP 339receive window. 340.It Va log_in_vain 341Log any connection attempts to ports where there is not a socket 342accepting connections. 343The value of 1 limits the logging to 344.Tn SYN 345(connection establishment) packets only. 346That of 2 results in any 347.Tn TCP 348packets to closed ports being logged. 349Any value unlisted above disables the logging 350(default is 0, i.e., the logging is disabled). 351.It Va msl 352The Maximum Segment Lifetime, in milliseconds, for a packet. 353.It Va keepinit 354Timeout, in milliseconds, for new, non-established 355.Tn TCP 356connections. 357The default is 75000 msec. 358.It Va keepidle 359Amount of time, in milliseconds, that the connection must be idle 360before keepalive probes (if enabled) are sent. 361The default is 7200000 msec (2 hours). 362.It Va keepintvl 363The interval, in milliseconds, between keepalive probes sent to remote 364machines, when no response is received on a 365.Va keepidle 366probe. 367The default is 75000 msec. 368.It Va keepcnt 369Number of probes sent, with no response, before a connection 370is dropped. 371The default is 8 packets. 372.It Va always_keepalive 373Assume that 374.Dv SO_KEEPALIVE 375is set on all 376.Tn TCP 377connections, the kernel will 378periodically send a packet to the remote host to verify the connection 379is still up. 380.It Va icmp_may_rst 381Certain 382.Tn ICMP 383unreachable messages may abort connections in 384.Tn SYN-SENT 385state. 386.It Va do_tcpdrain 387Flush packets in the 388.Tn TCP 389reassembly queue if the system is low on mbufs. 390.It Va blackhole 391If enabled, disable sending of RST when a connection is attempted 392to a port where there is not a socket accepting connections. 393See 394.Xr blackhole 4 . 395.It Va delayed_ack 396Delay ACK to try and piggyback it onto a data packet. 397.It Va delacktime 398Maximum amount of time, in milliseconds, before a delayed ACK is sent. 399.It Va path_mtu_discovery 400Enable Path MTU Discovery. 401.It Va tcbhashsize 402Size of the 403.Tn TCP 404control-block hash table 405(read-only). 406This may be tuned using the kernel option 407.Dv TCBHASHSIZE 408or by setting 409.Va net.inet.tcp.tcbhashsize 410in the 411.Xr loader 8 . 412.It Va pcbcount 413Number of active process control blocks 414(read-only). 415.It Va syncookies 416Determines whether or not 417.Tn SYN 418cookies should be generated for outbound 419.Tn SYN-ACK 420packets. 421.Tn SYN 422cookies are a great help during 423.Tn SYN 424flood attacks, and are enabled by default. 425(See 426.Xr syncookies 4 . ) 427.It Va isn_reseed_interval 428The interval (in seconds) specifying how often the secret data used in 429RFC 1948 initial sequence number calculations should be reseeded. 430By default, this variable is set to zero, indicating that 431no reseeding will occur. 432Reseeding should not be necessary, and will break 433.Dv TIME_WAIT 434recycling for a few minutes. 435.It Va rexmit_min , rexmit_slop 436Adjust the retransmit timer calculation for 437.Tn TCP . 438The slop is 439typically added to the raw calculation to take into account 440occasional variances that the 441.Tn SRTT 442(smoothed round-trip time) 443is unable to accommodate, while the minimum specifies an 444absolute minimum. 445While a number of 446.Tn TCP 447RFCs suggest a 1 448second minimum, these RFCs tend to focus on streaming behavior, 449and fail to deal with the fact that a 1 second minimum has severe 450detrimental effects over lossy interactive connections, such 451as a 802.11b wireless link, and over very fast but lossy 452connections for those cases not covered by the fast retransmit 453code. 454For this reason, we use 200ms of slop and a near-0 455minimum, which gives us an effective minimum of 200ms (similar to 456.Tn Linux ) . 457.It Va rfc3042 458Enable the Limited Transmit algorithm as described in RFC 3042. 459It helps avoid timeouts on lossy links and also when the congestion window 460is small, as happens on short transfers. 461.It Va rfc3390 462Enable support for RFC 3390, which allows for a variable-sized 463starting congestion window on new connections, depending on the 464maximum segment size. 465This helps throughput in general, but 466particularly affects short transfers and high-bandwidth large 467propagation-delay connections. 468.It Va sack.enable 469Enable support for RFC 2018, TCP Selective Acknowledgment option, 470which allows the receiver to inform the sender about all successfully 471arrived segments, allowing the sender to retransmit the missing segments 472only. 473.It Va sack.maxholes 474Maximum number of SACK holes per connection. 475Defaults to 128. 476.It Va sack.globalmaxholes 477Maximum number of SACK holes per system, across all connections. 478Defaults to 65536. 479.It Va maxtcptw 480When a TCP connection enters the 481.Dv TIME_WAIT 482state, its associated socket structure is freed, since it is of 483negligible size and use, and a new structure is allocated to contain a 484minimal amount of information necessary for sustaining a connection in 485this state, called the compressed TCP TIME_WAIT state. 486Since this structure is smaller than a socket structure, it can save 487a significant amount of system memory. 488The 489.Va net.inet.tcp.maxtcptw 490MIB variable controls the maximum number of these structures allocated. 491By default, it is initialized to 492.Va kern.ipc.maxsockets 493/ 5. 494.It Va nolocaltimewait 495Suppress creating of compressed TCP TIME_WAIT states for connections in 496which both endpoints are local. 497.It Va fast_finwait2_recycle 498Recycle 499.Tn TCP 500.Dv FIN_WAIT_2 501connections faster when the socket is marked as 502.Dv SBS_CANTRCVMORE 503(no user process has the socket open, data received on 504the socket cannot be read). 505The timeout used here is 506.Va finwait2_timeout . 507.It Va finwait2_timeout 508Timeout to use for fast recycling of 509.Tn TCP 510.Dv FIN_WAIT_2 511connections. 512Defaults to 60 seconds. 513.It Va ecn.enable 514Enable support for TCP Explicit Congestion Notification (ECN). 515ECN allows a TCP sender to reduce the transmission rate in order to 516avoid packet drops. 517.It Va ecn.maxretries 518Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 519specific connection. 520This is needed to help with connection establishment 521when a broken firewall is in the network path. 522.It Va pmtud_blackhole_detection 523Turn on automatic path MTU blackhole detection. 524In case of retransmits OS will 525lower the MSS to check if it's MTU problem. 526If current MSS is greater than 527configured value to try, it will be set to configured value, otherwise, 528MSS will be set to default values 529.Po Va net.inet.tcp.mssdflt 530and 531.Va net.inet.tcp.v6mssdflt 532.Pc . 533.It Va pmtud_blackhole_mss 534MSS to try for IPv4 if PMTU blackhole detection is turned on. 535.It Va v6pmtud_blackhole_mss 536MSS to try for IPv6 if PMTU blackhole detection is turned on. 537.It Va pmtud_blackhole_activated 538Number of times configured values were used in an attempt to downshift. 539.It Va pmtud_blackhole_activated_min_mss 540Number of times default MSS was used in an attempt to downshift. 541.It Va pmtud_blackhole_failed 542Number of connections for which retransmits continued even after MSS 543downshift. 544.El 545.Sh ERRORS 546A socket operation may fail with one of the following errors returned: 547.Bl -tag -width Er 548.It Bq Er EISCONN 549when trying to establish a connection on a socket which 550already has one; 551.It Bq Er ENOBUFS 552when the system runs out of memory for 553an internal data structure; 554.It Bq Er ETIMEDOUT 555when a connection was dropped 556due to excessive retransmissions; 557.It Bq Er ECONNRESET 558when the remote peer 559forces the connection to be closed; 560.It Bq Er ECONNREFUSED 561when the remote 562peer actively refuses connection establishment (usually because 563no process is listening to the port); 564.It Bq Er EADDRINUSE 565when an attempt 566is made to create a socket with a port which has already been 567allocated; 568.It Bq Er EADDRNOTAVAIL 569when an attempt is made to create a 570socket with a network address for which no network interface 571exists; 572.It Bq Er EAFNOSUPPORT 573when an attempt is made to bind or connect a socket to a multicast 574address. 575.El 576.Sh SEE ALSO 577.Xr getsockopt 2 , 578.Xr socket 2 , 579.Xr sysctl 3 , 580.Xr blackhole 4 , 581.Xr inet 4 , 582.Xr intro 4 , 583.Xr ip 4 , 584.Xr mod_cc 4 , 585.Xr siftr 4 , 586.Xr syncache 4 , 587.Xr setkey 8 588.Rs 589.%A "V. Jacobson" 590.%A "R. Braden" 591.%A "D. Borman" 592.%T "TCP Extensions for High Performance" 593.%O "RFC 1323" 594.Re 595.Rs 596.%A "A. Heffernan" 597.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 598.%O "RFC 2385" 599.Re 600.Rs 601.%A "K. Ramakrishnan" 602.%A "S. Floyd" 603.%A "D. Black" 604.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 605.%O "RFC 3168" 606.Re 607.Sh HISTORY 608The 609.Tn TCP 610protocol appeared in 611.Bx 4.2 . 612The RFC 1323 extensions for window scaling and timestamps were added 613in 614.Bx 4.4 . 615The 616.Dv TCP_INFO 617option was introduced in 618.Tn Linux 2.6 619and is 620.Em subject to change . 621