1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. 3.\" Copyright (c) 2010-2011 The FreeBSD Foundation 4.\" All rights reserved. 5.\" 6.\" Portions of this documentation were written at the Centre for Advanced 7.\" Internet Architectures, Swinburne University of Technology, Melbourne, 8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. Neither the name of the University nor the names of its contributors 19.\" may be used to endorse or promote products derived from this software 20.\" without specific prior written permission. 21.\" 22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 25.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 32.\" SUCH DAMAGE. 33.\" 34.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 35.\" $FreeBSD$ 36.\" 37.Dd May 19, 2016 38.Dt TCP 4 39.Os 40.Sh NAME 41.Nm tcp 42.Nd Internet Transmission Control Protocol 43.Sh SYNOPSIS 44.In sys/types.h 45.In sys/socket.h 46.In netinet/in.h 47.In netinet/tcp.h 48.Ft int 49.Fn socket AF_INET SOCK_STREAM 0 50.Sh DESCRIPTION 51The 52.Tn TCP 53protocol provides reliable, flow-controlled, two-way 54transmission of data. 55It is a byte-stream protocol used to 56support the 57.Dv SOCK_STREAM 58abstraction. 59.Tn TCP 60uses the standard 61Internet address format and, in addition, provides a per-host 62collection of 63.Dq "port addresses" . 64Thus, each address is composed 65of an Internet address specifying the host and network, 66with a specific 67.Tn TCP 68port on the host identifying the peer entity. 69.Pp 70Sockets utilizing the 71.Tn TCP 72protocol are either 73.Dq active 74or 75.Dq passive . 76Active sockets initiate connections to passive 77sockets. 78By default, 79.Tn TCP 80sockets are created active; to create a 81passive socket, the 82.Xr listen 2 83system call must be used 84after binding the socket with the 85.Xr bind 2 86system call. 87Only passive sockets may use the 88.Xr accept 2 89call to accept incoming connections. 90Only active sockets may use the 91.Xr connect 2 92call to initiate connections. 93.Pp 94Passive sockets may 95.Dq underspecify 96their location to match 97incoming connection requests from multiple networks. 98This technique, termed 99.Dq "wildcard addressing" , 100allows a single 101server to provide service to clients on multiple networks. 102To create a socket which listens on all networks, the Internet 103address 104.Dv INADDR_ANY 105must be bound. 106The 107.Tn TCP 108port may still be specified 109at this time; if the port is not specified, the system will assign one. 110Once a connection has been established, the socket's address is 111fixed by the peer entity's location. 112The address assigned to the 113socket is the address associated with the network interface 114through which packets are being transmitted and received. 115Normally, this address corresponds to the peer entity's network. 116.Pp 117.Tn TCP 118supports a number of socket options which can be set with 119.Xr setsockopt 2 120and tested with 121.Xr getsockopt 2 : 122.Bl -tag -width ".Dv TCP_CONGESTION" 123.It Dv TCP_INFO 124Information about a socket's underlying TCP session may be retrieved 125by passing the read-only option 126.Dv TCP_INFO 127to 128.Xr getsockopt 2 . 129It accepts a single argument: a pointer to an instance of 130.Vt "struct tcp_info" . 131.Pp 132This API is subject to change; consult the source to determine 133which fields are currently filled out by this option. 134.Fx 135specific additions include 136send window size, 137receive window size, 138and 139bandwidth-controlled window space. 140.It Dv TCP_CCALGOOPT 141Set or query congestion control algorithm specific parameters. 142See 143.Xr mod_cc 4 144for details. 145.It Dv TCP_CONGESTION 146Select or query the congestion control algorithm that TCP will use for the 147connection. 148See 149.Xr mod_cc 4 150for details. 151.It Dv TCP_KEEPINIT 152This 153.Xr setsockopt 2 154option accepts a per-socket timeout argument of 155.Vt "u_int" 156in seconds, for new, non-established 157.Tn TCP 158connections. 159For the global default in milliseconds see 160.Va keepinit 161in the 162.Sx MIB Variables 163section further down. 164.It Dv TCP_KEEPIDLE 165This 166.Xr setsockopt 2 167option accepts an argument of 168.Vt "u_int" 169for the amount of time, in seconds, that the connection must be idle 170before keepalive probes (if enabled) are sent for the connection of this 171socket. 172If set on a listening socket, the value is inherited by the newly created 173socket upon 174.Xr accept 2 . 175For the global default in milliseconds see 176.Va keepidle 177in the 178.Sx MIB Variables 179section further down. 180.It Dv TCP_KEEPINTVL 181This 182.Xr setsockopt 2 183option accepts an argument of 184.Vt "u_int" 185to set the per-socket interval, in seconds, between keepalive probes sent 186to a peer. 187If set on a listening socket, the value is inherited by the newly created 188socket upon 189.Xr accept 2 . 190For the global default in milliseconds see 191.Va keepintvl 192in the 193.Sx MIB Variables 194section further down. 195.It Dv TCP_KEEPCNT 196This 197.Xr setsockopt 2 198option accepts an argument of 199.Vt "u_int" 200and allows a per-socket tuning of the number of probes sent, with no response, 201before the connection will be dropped. 202If set on a listening socket, the value is inherited by the newly created 203socket upon 204.Xr accept 2 . 205For the global default see the 206.Va keepcnt 207in the 208.Sx MIB Variables 209section further down. 210.It Dv TCP_NODELAY 211Under most circumstances, 212.Tn TCP 213sends data when it is presented; 214when outstanding data has not yet been acknowledged, it gathers 215small amounts of output to be sent in a single packet once 216an acknowledgement is received. 217For a small number of clients, such as window systems 218that send a stream of mouse events which receive no replies, 219this packetization may cause significant delays. 220The boolean option 221.Dv TCP_NODELAY 222defeats this algorithm. 223.It Dv TCP_MAXSEG 224By default, a sender- and 225.No receiver- Ns Tn TCP 226will negotiate among themselves to determine the maximum segment size 227to be used for each connection. 228The 229.Dv TCP_MAXSEG 230option allows the user to determine the result of this negotiation, 231and to reduce it if desired. 232.It Dv TCP_NOOPT 233.Tn TCP 234usually sends a number of options in each packet, corresponding to 235various 236.Tn TCP 237extensions which are provided in this implementation. 238The boolean option 239.Dv TCP_NOOPT 240is provided to disable 241.Tn TCP 242option use on a per-connection basis. 243.It Dv TCP_NOPUSH 244By convention, the 245.No sender- Ns Tn TCP 246will set the 247.Dq push 248bit, and begin transmission immediately (if permitted) at the end of 249every user call to 250.Xr write 2 251or 252.Xr writev 2 . 253When this option is set to a non-zero value, 254.Tn TCP 255will delay sending any data at all until either the socket is closed, 256or the internal send buffer is filled. 257.It Dv TCP_MD5SIG 258This option enables the use of MD5 digests (also known as TCP-MD5) 259on writes to the specified socket. 260Outgoing traffic is digested; 261digests on incoming traffic are verified if the 262.Va net.inet.tcp.signature_verify_input 263sysctl is nonzero. 264The current default behavior for the system is to respond to a system 265advertising this option with TCP-MD5; this may change. 266.Pp 267One common use for this in a 268.Fx 269router deployment is to enable 270based routers to interwork with Cisco equipment at peering points. 271Support for this feature conforms to RFC 2385. 272Only IPv4 273.Pq Dv AF_INET 274sessions are supported. 275.Pp 276In order for this option to function correctly, it is necessary for the 277administrator to add a tcp-md5 key entry to the system's security 278associations database (SADB) using the 279.Xr setkey 8 280utility. 281This entry must have an SPI of 0x1000 and can therefore only be specified 282on a per-host basis at this time. 283.Pp 284If an SADB entry cannot be found for the destination, the outgoing traffic 285will have an invalid digest option prepended, and the following error message 286will be visible on the system console: 287.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 288.El 289.Pp 290The option level for the 291.Xr setsockopt 2 292call is the protocol number for 293.Tn TCP , 294available from 295.Xr getprotobyname 3 , 296or 297.Dv IPPROTO_TCP . 298All options are declared in 299.In netinet/tcp.h . 300.Pp 301Options at the 302.Tn IP 303transport level may be used with 304.Tn TCP ; 305see 306.Xr ip 4 . 307Incoming connection requests that are source-routed are noted, 308and the reverse source route is used in responding. 309.Pp 310The default congestion control algorithm for 311.Tn TCP 312is 313.Xr cc_newreno 4 . 314Other congestion control algorithms can be made available using the 315.Xr mod_cc 4 316framework. 317.Ss MIB Variables 318The 319.Tn TCP 320protocol implements a number of variables in the 321.Va net.inet.tcp 322branch of the 323.Xr sysctl 3 324MIB. 325.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 326.It Dv TCPCTL_DO_RFC1323 327.Pq Va rfc1323 328Implement the window scaling and timestamp options of RFC 1323 329(default is true). 330.It Dv TCPCTL_MSSDFLT 331.Pq Va mssdflt 332The default value used for the maximum segment size 333.Pq Dq MSS 334when no advice to the contrary is received from MSS negotiation. 335.It Dv TCPCTL_SENDSPACE 336.Pq Va sendspace 337Maximum 338.Tn TCP 339send window. 340.It Dv TCPCTL_RECVSPACE 341.Pq Va recvspace 342Maximum 343.Tn TCP 344receive window. 345.It Va log_in_vain 346Log any connection attempts to ports where there is not a socket 347accepting connections. 348The value of 1 limits the logging to 349.Tn SYN 350(connection establishment) packets only. 351That of 2 results in any 352.Tn TCP 353packets to closed ports being logged. 354Any value unlisted above disables the logging 355(default is 0, i.e., the logging is disabled). 356.It Va msl 357The Maximum Segment Lifetime, in milliseconds, for a packet. 358.It Va keepinit 359Timeout, in milliseconds, for new, non-established 360.Tn TCP 361connections. 362The default is 75000 msec. 363.It Va keepidle 364Amount of time, in milliseconds, that the connection must be idle 365before keepalive probes (if enabled) are sent. 366The default is 7200000 msec (2 hours). 367.It Va keepintvl 368The interval, in milliseconds, between keepalive probes sent to remote 369machines, when no response is received on a 370.Va keepidle 371probe. 372The default is 75000 msec. 373.It Va keepcnt 374Number of probes sent, with no response, before a connection 375is dropped. 376The default is 8 packets. 377.It Va always_keepalive 378Assume that 379.Dv SO_KEEPALIVE 380is set on all 381.Tn TCP 382connections, the kernel will 383periodically send a packet to the remote host to verify the connection 384is still up. 385.It Va icmp_may_rst 386Certain 387.Tn ICMP 388unreachable messages may abort connections in 389.Tn SYN-SENT 390state. 391.It Va do_tcpdrain 392Flush packets in the 393.Tn TCP 394reassembly queue if the system is low on mbufs. 395.It Va blackhole 396If enabled, disable sending of RST when a connection is attempted 397to a port where there is not a socket accepting connections. 398See 399.Xr blackhole 4 . 400.It Va delayed_ack 401Delay ACK to try and piggyback it onto a data packet. 402.It Va delacktime 403Maximum amount of time, in milliseconds, before a delayed ACK is sent. 404.It Va path_mtu_discovery 405Enable Path MTU Discovery. 406.It Va tcbhashsize 407Size of the 408.Tn TCP 409control-block hash table 410(read-only). 411This may be tuned using the kernel option 412.Dv TCBHASHSIZE 413or by setting 414.Va net.inet.tcp.tcbhashsize 415in the 416.Xr loader 8 . 417.It Va pcbcount 418Number of active process control blocks 419(read-only). 420.It Va syncookies 421Determines whether or not 422.Tn SYN 423cookies should be generated for outbound 424.Tn SYN-ACK 425packets. 426.Tn SYN 427cookies are a great help during 428.Tn SYN 429flood attacks, and are enabled by default. 430(See 431.Xr syncookies 4 . ) 432.It Va isn_reseed_interval 433The interval (in seconds) specifying how often the secret data used in 434RFC 1948 initial sequence number calculations should be reseeded. 435By default, this variable is set to zero, indicating that 436no reseeding will occur. 437Reseeding should not be necessary, and will break 438.Dv TIME_WAIT 439recycling for a few minutes. 440.It Va rexmit_min , rexmit_slop 441Adjust the retransmit timer calculation for 442.Tn TCP . 443The slop is 444typically added to the raw calculation to take into account 445occasional variances that the 446.Tn SRTT 447(smoothed round-trip time) 448is unable to accommodate, while the minimum specifies an 449absolute minimum. 450While a number of 451.Tn TCP 452RFCs suggest a 1 453second minimum, these RFCs tend to focus on streaming behavior, 454and fail to deal with the fact that a 1 second minimum has severe 455detrimental effects over lossy interactive connections, such 456as a 802.11b wireless link, and over very fast but lossy 457connections for those cases not covered by the fast retransmit 458code. 459For this reason, we use 200ms of slop and a near-0 460minimum, which gives us an effective minimum of 200ms (similar to 461.Tn Linux ) . 462.It Va initcwnd_segments 463Enable the ability to specify initial congestion window in number of segments. 464The default value is 10 as suggested by RFC 6928. 465Changing the value on fly would not affect connections using congestion window 466from the hostcache. 467Caution: 468This regulates the burst of packets allowed to be sent in the first RTT. 469The value should be relative to the link capacity. 470Start with small values for lower-capacity links. 471Large bursts can cause buffer overruns and packet drops if routers have small 472buffers or the link is experiencing congestion. 473.It Va rfc3042 474Enable the Limited Transmit algorithm as described in RFC 3042. 475It helps avoid timeouts on lossy links and also when the congestion window 476is small, as happens on short transfers. 477.It Va rfc3390 478Enable support for RFC 3390, which allows for a variable-sized 479starting congestion window on new connections, depending on the 480maximum segment size. 481This helps throughput in general, but 482particularly affects short transfers and high-bandwidth large 483propagation-delay connections. 484.It Va sack.enable 485Enable support for RFC 2018, TCP Selective Acknowledgment option, 486which allows the receiver to inform the sender about all successfully 487arrived segments, allowing the sender to retransmit the missing segments 488only. 489.It Va sack.maxholes 490Maximum number of SACK holes per connection. 491Defaults to 128. 492.It Va sack.globalmaxholes 493Maximum number of SACK holes per system, across all connections. 494Defaults to 65536. 495.It Va maxtcptw 496When a TCP connection enters the 497.Dv TIME_WAIT 498state, its associated socket structure is freed, since it is of 499negligible size and use, and a new structure is allocated to contain a 500minimal amount of information necessary for sustaining a connection in 501this state, called the compressed TCP TIME_WAIT state. 502Since this structure is smaller than a socket structure, it can save 503a significant amount of system memory. 504The 505.Va net.inet.tcp.maxtcptw 506MIB variable controls the maximum number of these structures allocated. 507By default, it is initialized to 508.Va kern.ipc.maxsockets 509/ 5. 510.It Va nolocaltimewait 511Suppress creating of compressed TCP TIME_WAIT states for connections in 512which both endpoints are local. 513.It Va fast_finwait2_recycle 514Recycle 515.Tn TCP 516.Dv FIN_WAIT_2 517connections faster when the socket is marked as 518.Dv SBS_CANTRCVMORE 519(no user process has the socket open, data received on 520the socket cannot be read). 521The timeout used here is 522.Va finwait2_timeout . 523.It Va finwait2_timeout 524Timeout to use for fast recycling of 525.Tn TCP 526.Dv FIN_WAIT_2 527connections. 528Defaults to 60 seconds. 529.It Va ecn.enable 530Enable support for TCP Explicit Congestion Notification (ECN). 531ECN allows a TCP sender to reduce the transmission rate in order to 532avoid packet drops. 533Settings: 534.Bl -tag -compact 535.It 0 536Disable ECN. 537.It 1 538Allow incoming connections to request ECN. 539Outgoing connections will request ECN. 540.It 2 541Allow incoming connections to request ECN. 542Outgoing connections will not request ECN. 543.El 544.It Va ecn.maxretries 545Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 546specific connection. 547This is needed to help with connection establishment 548when a broken firewall is in the network path. 549.It Va pmtud_blackhole_detection 550Turn on automatic path MTU blackhole detection. 551In case of retransmits OS will 552lower the MSS to check if it's MTU problem. 553If current MSS is greater than 554configured value to try, it will be set to configured value, otherwise, 555MSS will be set to default values 556.Po Va net.inet.tcp.mssdflt 557and 558.Va net.inet.tcp.v6mssdflt 559.Pc . 560.It Va pmtud_blackhole_mss 561MSS to try for IPv4 if PMTU blackhole detection is turned on. 562.It Va v6pmtud_blackhole_mss 563MSS to try for IPv6 if PMTU blackhole detection is turned on. 564.It Va pmtud_blackhole_activated 565Number of times configured values were used in an attempt to downshift. 566.It Va pmtud_blackhole_activated_min_mss 567Number of times default MSS was used in an attempt to downshift. 568.It Va pmtud_blackhole_failed 569Number of connections for which retransmits continued even after MSS 570downshift. 571.El 572.Sh ERRORS 573A socket operation may fail with one of the following errors returned: 574.Bl -tag -width Er 575.It Bq Er EISCONN 576when trying to establish a connection on a socket which 577already has one; 578.It Bq Er ENOBUFS 579when the system runs out of memory for 580an internal data structure; 581.It Bq Er ETIMEDOUT 582when a connection was dropped 583due to excessive retransmissions; 584.It Bq Er ECONNRESET 585when the remote peer 586forces the connection to be closed; 587.It Bq Er ECONNREFUSED 588when the remote 589peer actively refuses connection establishment (usually because 590no process is listening to the port); 591.It Bq Er EADDRINUSE 592when an attempt 593is made to create a socket with a port which has already been 594allocated; 595.It Bq Er EADDRNOTAVAIL 596when an attempt is made to create a 597socket with a network address for which no network interface 598exists; 599.It Bq Er EAFNOSUPPORT 600when an attempt is made to bind or connect a socket to a multicast 601address. 602.El 603.Sh SEE ALSO 604.Xr getsockopt 2 , 605.Xr socket 2 , 606.Xr sysctl 3 , 607.Xr blackhole 4 , 608.Xr inet 4 , 609.Xr intro 4 , 610.Xr ip 4 , 611.Xr mod_cc 4 , 612.Xr siftr 4 , 613.Xr syncache 4 , 614.Xr setkey 8 615.Rs 616.%A "V. Jacobson" 617.%A "R. Braden" 618.%A "D. Borman" 619.%T "TCP Extensions for High Performance" 620.%O "RFC 1323" 621.Re 622.Rs 623.%A "A. Heffernan" 624.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 625.%O "RFC 2385" 626.Re 627.Rs 628.%A "K. Ramakrishnan" 629.%A "S. Floyd" 630.%A "D. Black" 631.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 632.%O "RFC 3168" 633.Re 634.Sh HISTORY 635The 636.Tn TCP 637protocol appeared in 638.Bx 4.2 . 639The RFC 1323 extensions for window scaling and timestamps were added 640in 641.Bx 4.4 . 642The 643.Dv TCP_INFO 644option was introduced in 645.Tn Linux 2.6 646and is 647.Em subject to change . 648