1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. 3.\" Copyright (c) 2010-2011 The FreeBSD Foundation 4.\" All rights reserved. 5.\" 6.\" Portions of this documentation were written at the Centre for Advanced 7.\" Internet Architectures, Swinburne University of Technology, Melbourne, 8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. Neither the name of the University nor the names of its contributors 19.\" may be used to endorse or promote products derived from this software 20.\" without specific prior written permission. 21.\" 22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 25.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 32.\" SUCH DAMAGE. 33.\" 34.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 35.\" $FreeBSD$ 36.\" 37.Dd May 9, 2018 38.Dt TCP 4 39.Os 40.Sh NAME 41.Nm tcp 42.Nd Internet Transmission Control Protocol 43.Sh SYNOPSIS 44.In sys/types.h 45.In sys/socket.h 46.In netinet/in.h 47.In netinet/tcp.h 48.Ft int 49.Fn socket AF_INET SOCK_STREAM 0 50.Sh DESCRIPTION 51The 52.Tn TCP 53protocol provides reliable, flow-controlled, two-way 54transmission of data. 55It is a byte-stream protocol used to 56support the 57.Dv SOCK_STREAM 58abstraction. 59.Tn TCP 60uses the standard 61Internet address format and, in addition, provides a per-host 62collection of 63.Dq "port addresses" . 64Thus, each address is composed 65of an Internet address specifying the host and network, 66with a specific 67.Tn TCP 68port on the host identifying the peer entity. 69.Pp 70Sockets utilizing the 71.Tn TCP 72protocol are either 73.Dq active 74or 75.Dq passive . 76Active sockets initiate connections to passive 77sockets. 78By default, 79.Tn TCP 80sockets are created active; to create a 81passive socket, the 82.Xr listen 2 83system call must be used 84after binding the socket with the 85.Xr bind 2 86system call. 87Only passive sockets may use the 88.Xr accept 2 89call to accept incoming connections. 90Only active sockets may use the 91.Xr connect 2 92call to initiate connections. 93.Pp 94Passive sockets may 95.Dq underspecify 96their location to match 97incoming connection requests from multiple networks. 98This technique, termed 99.Dq "wildcard addressing" , 100allows a single 101server to provide service to clients on multiple networks. 102To create a socket which listens on all networks, the Internet 103address 104.Dv INADDR_ANY 105must be bound. 106The 107.Tn TCP 108port may still be specified 109at this time; if the port is not specified, the system will assign one. 110Once a connection has been established, the socket's address is 111fixed by the peer entity's location. 112The address assigned to the 113socket is the address associated with the network interface 114through which packets are being transmitted and received. 115Normally, this address corresponds to the peer entity's network. 116.Pp 117.Tn TCP 118supports a number of socket options which can be set with 119.Xr setsockopt 2 120and tested with 121.Xr getsockopt 2 : 122.Bl -tag -width ".Dv TCP_FUNCTION_BLK" 123.It Dv TCP_INFO 124Information about a socket's underlying TCP session may be retrieved 125by passing the read-only option 126.Dv TCP_INFO 127to 128.Xr getsockopt 2 . 129It accepts a single argument: a pointer to an instance of 130.Vt "struct tcp_info" . 131.Pp 132This API is subject to change; consult the source to determine 133which fields are currently filled out by this option. 134.Fx 135specific additions include 136send window size, 137receive window size, 138and 139bandwidth-controlled window space. 140.It Dv TCP_CCALGOOPT 141Set or query congestion control algorithm specific parameters. 142See 143.Xr mod_cc 4 144for details. 145.It Dv TCP_CONGESTION 146Select or query the congestion control algorithm that TCP will use for the 147connection. 148See 149.Xr mod_cc 4 150for details. 151.It Dv TCP_FUNCTION_BLK 152Select or query the set of functions that TCP will use for this connection. 153This allows a user to select an alternate TCP stack. 154The alternate TCP stack must already be loaded in the kernel. 155To list the available TCP stacks, see 156.Va functions_available 157in the 158.Sx MIB Variables 159section further down. 160To list the default TCP stack, see 161.Va functions_default 162in the 163.Sx MIB Variables 164section. 165.It Dv TCP_KEEPINIT 166This 167.Xr setsockopt 2 168option accepts a per-socket timeout argument of 169.Vt "u_int" 170in seconds, for new, non-established 171.Tn TCP 172connections. 173For the global default in milliseconds see 174.Va keepinit 175in the 176.Sx MIB Variables 177section further down. 178.It Dv TCP_KEEPIDLE 179This 180.Xr setsockopt 2 181option accepts an argument of 182.Vt "u_int" 183for the amount of time, in seconds, that the connection must be idle 184before keepalive probes (if enabled) are sent for the connection of this 185socket. 186If set on a listening socket, the value is inherited by the newly created 187socket upon 188.Xr accept 2 . 189For the global default in milliseconds see 190.Va keepidle 191in the 192.Sx MIB Variables 193section further down. 194.It Dv TCP_KEEPINTVL 195This 196.Xr setsockopt 2 197option accepts an argument of 198.Vt "u_int" 199to set the per-socket interval, in seconds, between keepalive probes sent 200to a peer. 201If set on a listening socket, the value is inherited by the newly created 202socket upon 203.Xr accept 2 . 204For the global default in milliseconds see 205.Va keepintvl 206in the 207.Sx MIB Variables 208section further down. 209.It Dv TCP_KEEPCNT 210This 211.Xr setsockopt 2 212option accepts an argument of 213.Vt "u_int" 214and allows a per-socket tuning of the number of probes sent, with no response, 215before the connection will be dropped. 216If set on a listening socket, the value is inherited by the newly created 217socket upon 218.Xr accept 2 . 219For the global default see the 220.Va keepcnt 221in the 222.Sx MIB Variables 223section further down. 224.It Dv TCP_NODELAY 225Under most circumstances, 226.Tn TCP 227sends data when it is presented; 228when outstanding data has not yet been acknowledged, it gathers 229small amounts of output to be sent in a single packet once 230an acknowledgement is received. 231For a small number of clients, such as window systems 232that send a stream of mouse events which receive no replies, 233this packetization may cause significant delays. 234The boolean option 235.Dv TCP_NODELAY 236defeats this algorithm. 237.It Dv TCP_MAXSEG 238By default, a sender- and 239.No receiver- Ns Tn TCP 240will negotiate among themselves to determine the maximum segment size 241to be used for each connection. 242The 243.Dv TCP_MAXSEG 244option allows the user to determine the result of this negotiation, 245and to reduce it if desired. 246.It Dv TCP_NOOPT 247.Tn TCP 248usually sends a number of options in each packet, corresponding to 249various 250.Tn TCP 251extensions which are provided in this implementation. 252The boolean option 253.Dv TCP_NOOPT 254is provided to disable 255.Tn TCP 256option use on a per-connection basis. 257.It Dv TCP_NOPUSH 258By convention, the 259.No sender- Ns Tn TCP 260will set the 261.Dq push 262bit, and begin transmission immediately (if permitted) at the end of 263every user call to 264.Xr write 2 265or 266.Xr writev 2 . 267When this option is set to a non-zero value, 268.Tn TCP 269will delay sending any data at all until either the socket is closed, 270or the internal send buffer is filled. 271.It Dv TCP_MD5SIG 272This option enables the use of MD5 digests (also known as TCP-MD5) 273on writes to the specified socket. 274Outgoing traffic is digested; 275digests on incoming traffic are verified. 276When this option is enabled on a socket, all inbound and outgoing 277TCP segments must be signed with MD5 digests. 278.Pp 279One common use for this in a 280.Fx 281router deployment is to enable 282based routers to interwork with Cisco equipment at peering points. 283Support for this feature conforms to RFC 2385. 284.Pp 285In order for this option to function correctly, it is necessary for the 286administrator to add a tcp-md5 key entry to the system's security 287associations database (SADB) using the 288.Xr setkey 8 289utility. 290This entry can only be specified on a per-host basis at this time. 291.Pp 292If an SADB entry cannot be found for the destination, 293the system does not send any outgoing segments and drops any inbound segments. 294.Pp 295Each dropped segment is taken into account in the TCP protocol statistics. 296.El 297.Pp 298The option level for the 299.Xr setsockopt 2 300call is the protocol number for 301.Tn TCP , 302available from 303.Xr getprotobyname 3 , 304or 305.Dv IPPROTO_TCP . 306All options are declared in 307.In netinet/tcp.h . 308.Pp 309Options at the 310.Tn IP 311transport level may be used with 312.Tn TCP ; 313see 314.Xr ip 4 . 315Incoming connection requests that are source-routed are noted, 316and the reverse source route is used in responding. 317.Pp 318The default congestion control algorithm for 319.Tn TCP 320is 321.Xr cc_newreno 4 . 322Other congestion control algorithms can be made available using the 323.Xr mod_cc 4 324framework. 325.Ss MIB Variables 326The 327.Tn TCP 328protocol implements a number of variables in the 329.Va net.inet.tcp 330branch of the 331.Xr sysctl 3 332MIB. 333.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 334.It Dv TCPCTL_DO_RFC1323 335.Pq Va rfc1323 336Implement the window scaling and timestamp options of RFC 1323 337(default is true). 338.It Dv TCPCTL_MSSDFLT 339.Pq Va mssdflt 340The default value used for the maximum segment size 341.Pq Dq MSS 342when no advice to the contrary is received from MSS negotiation. 343.It Dv TCPCTL_SENDSPACE 344.Pq Va sendspace 345Maximum 346.Tn TCP 347send window. 348.It Dv TCPCTL_RECVSPACE 349.Pq Va recvspace 350Maximum 351.Tn TCP 352receive window. 353.It Va log_in_vain 354Log any connection attempts to ports where there is not a socket 355accepting connections. 356The value of 1 limits the logging to 357.Tn SYN 358(connection establishment) packets only. 359That of 2 results in any 360.Tn TCP 361packets to closed ports being logged. 362Any value unlisted above disables the logging 363(default is 0, i.e., the logging is disabled). 364.It Va msl 365The Maximum Segment Lifetime, in milliseconds, for a packet. 366.It Va keepinit 367Timeout, in milliseconds, for new, non-established 368.Tn TCP 369connections. 370The default is 75000 msec. 371.It Va keepidle 372Amount of time, in milliseconds, that the connection must be idle 373before keepalive probes (if enabled) are sent. 374The default is 7200000 msec (2 hours). 375.It Va keepintvl 376The interval, in milliseconds, between keepalive probes sent to remote 377machines, when no response is received on a 378.Va keepidle 379probe. 380The default is 75000 msec. 381.It Va keepcnt 382Number of probes sent, with no response, before a connection 383is dropped. 384The default is 8 packets. 385.It Va always_keepalive 386Assume that 387.Dv SO_KEEPALIVE 388is set on all 389.Tn TCP 390connections, the kernel will 391periodically send a packet to the remote host to verify the connection 392is still up. 393.It Va icmp_may_rst 394Certain 395.Tn ICMP 396unreachable messages may abort connections in 397.Tn SYN-SENT 398state. 399.It Va do_tcpdrain 400Flush packets in the 401.Tn TCP 402reassembly queue if the system is low on mbufs. 403.It Va blackhole 404If enabled, disable sending of RST when a connection is attempted 405to a port where there is not a socket accepting connections. 406See 407.Xr blackhole 4 . 408.It Va delayed_ack 409Delay ACK to try and piggyback it onto a data packet. 410.It Va delacktime 411Maximum amount of time, in milliseconds, before a delayed ACK is sent. 412.It Va path_mtu_discovery 413Enable Path MTU Discovery. 414.It Va tcbhashsize 415Size of the 416.Tn TCP 417control-block hash table 418(read-only). 419This may be tuned using the kernel option 420.Dv TCBHASHSIZE 421or by setting 422.Va net.inet.tcp.tcbhashsize 423in the 424.Xr loader 8 . 425.It Va pcbcount 426Number of active process control blocks 427(read-only). 428.It Va syncookies 429Determines whether or not 430.Tn SYN 431cookies should be generated for outbound 432.Tn SYN-ACK 433packets. 434.Tn SYN 435cookies are a great help during 436.Tn SYN 437flood attacks, and are enabled by default. 438(See 439.Xr syncookies 4 . ) 440.It Va isn_reseed_interval 441The interval (in seconds) specifying how often the secret data used in 442RFC 1948 initial sequence number calculations should be reseeded. 443By default, this variable is set to zero, indicating that 444no reseeding will occur. 445Reseeding should not be necessary, and will break 446.Dv TIME_WAIT 447recycling for a few minutes. 448.It Va rexmit_min , rexmit_slop 449Adjust the retransmit timer calculation for 450.Tn TCP . 451The slop is 452typically added to the raw calculation to take into account 453occasional variances that the 454.Tn SRTT 455(smoothed round-trip time) 456is unable to accommodate, while the minimum specifies an 457absolute minimum. 458While a number of 459.Tn TCP 460RFCs suggest a 1 461second minimum, these RFCs tend to focus on streaming behavior, 462and fail to deal with the fact that a 1 second minimum has severe 463detrimental effects over lossy interactive connections, such 464as a 802.11b wireless link, and over very fast but lossy 465connections for those cases not covered by the fast retransmit 466code. 467For this reason, we use 200ms of slop and a near-0 468minimum, which gives us an effective minimum of 200ms (similar to 469.Tn Linux ) . 470.It Va initcwnd_segments 471Enable the ability to specify initial congestion window in number of segments. 472The default value is 10 as suggested by RFC 6928. 473Changing the value on fly would not affect connections using congestion window 474from the hostcache. 475Caution: 476This regulates the burst of packets allowed to be sent in the first RTT. 477The value should be relative to the link capacity. 478Start with small values for lower-capacity links. 479Large bursts can cause buffer overruns and packet drops if routers have small 480buffers or the link is experiencing congestion. 481.It Va rfc3042 482Enable the Limited Transmit algorithm as described in RFC 3042. 483It helps avoid timeouts on lossy links and also when the congestion window 484is small, as happens on short transfers. 485.It Va rfc3390 486Enable support for RFC 3390, which allows for a variable-sized 487starting congestion window on new connections, depending on the 488maximum segment size. 489This helps throughput in general, but 490particularly affects short transfers and high-bandwidth large 491propagation-delay connections. 492.It Va sack.enable 493Enable support for RFC 2018, TCP Selective Acknowledgment option, 494which allows the receiver to inform the sender about all successfully 495arrived segments, allowing the sender to retransmit the missing segments 496only. 497.It Va sack.maxholes 498Maximum number of SACK holes per connection. 499Defaults to 128. 500.It Va sack.globalmaxholes 501Maximum number of SACK holes per system, across all connections. 502Defaults to 65536. 503.It Va maxtcptw 504When a TCP connection enters the 505.Dv TIME_WAIT 506state, its associated socket structure is freed, since it is of 507negligible size and use, and a new structure is allocated to contain a 508minimal amount of information necessary for sustaining a connection in 509this state, called the compressed TCP TIME_WAIT state. 510Since this structure is smaller than a socket structure, it can save 511a significant amount of system memory. 512The 513.Va net.inet.tcp.maxtcptw 514MIB variable controls the maximum number of these structures allocated. 515By default, it is initialized to 516.Va kern.ipc.maxsockets 517/ 5. 518.It Va nolocaltimewait 519Suppress creating of compressed TCP TIME_WAIT states for connections in 520which both endpoints are local. 521.It Va fast_finwait2_recycle 522Recycle 523.Tn TCP 524.Dv FIN_WAIT_2 525connections faster when the socket is marked as 526.Dv SBS_CANTRCVMORE 527(no user process has the socket open, data received on 528the socket cannot be read). 529The timeout used here is 530.Va finwait2_timeout . 531.It Va finwait2_timeout 532Timeout to use for fast recycling of 533.Tn TCP 534.Dv FIN_WAIT_2 535connections. 536Defaults to 60 seconds. 537.It Va ecn.enable 538Enable support for TCP Explicit Congestion Notification (ECN). 539ECN allows a TCP sender to reduce the transmission rate in order to 540avoid packet drops. 541Settings: 542.Bl -tag -compact 543.It 0 544Disable ECN. 545.It 1 546Allow incoming connections to request ECN. 547Outgoing connections will request ECN. 548.It 2 549Allow incoming connections to request ECN. 550Outgoing connections will not request ECN. 551.El 552.It Va ecn.maxretries 553Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 554specific connection. 555This is needed to help with connection establishment 556when a broken firewall is in the network path. 557.It Va pmtud_blackhole_detection 558Turn on automatic path MTU blackhole detection. 559In case of retransmits OS will 560lower the MSS to check if it's MTU problem. 561If current MSS is greater than 562configured value to try, it will be set to configured value, otherwise, 563MSS will be set to default values 564.Po Va net.inet.tcp.mssdflt 565and 566.Va net.inet.tcp.v6mssdflt 567.Pc . 568.It Va pmtud_blackhole_mss 569MSS to try for IPv4 if PMTU blackhole detection is turned on. 570.It Va v6pmtud_blackhole_mss 571MSS to try for IPv6 if PMTU blackhole detection is turned on. 572.It Va pmtud_blackhole_activated 573Number of times configured values were used in an attempt to downshift. 574.It Va pmtud_blackhole_activated_min_mss 575Number of times default MSS was used in an attempt to downshift. 576.It Va pmtud_blackhole_failed 577Number of connections for which retransmits continued even after MSS 578downshift. 579.It Va functions_available 580List of available TCP function blocks (TCP stacks). 581.It Va functions_default 582The default TCP function block (TCP stack). 583.It Va functions_inherit_listen_socket_stack 584Determines whether to inherit listen socket's tcp stack or use the current 585system default tcp stack, as defined by 586.Va functions_default 587.Pc . 588Default is true. 589.It Va insecure_rst 590Use criteria defined in RFC793 instead of RFC5961 for accepting RST segments. 591Default is false. 592.It Va insecure_syn 593Use criteria defined in RFC793 instead of RFC5961 for accepting SYN segments. 594Default is false. 595.El 596.Sh ERRORS 597A socket operation may fail with one of the following errors returned: 598.Bl -tag -width Er 599.It Bq Er EISCONN 600when trying to establish a connection on a socket which 601already has one; 602.It Bo Er ENOBUFS Bc or Bo Er ENOMEM Bc 603when the system runs out of memory for 604an internal data structure; 605.It Bq Er ETIMEDOUT 606when a connection was dropped 607due to excessive retransmissions; 608.It Bq Er ECONNRESET 609when the remote peer 610forces the connection to be closed; 611.It Bq Er ECONNREFUSED 612when the remote 613peer actively refuses connection establishment (usually because 614no process is listening to the port); 615.It Bq Er EADDRINUSE 616when an attempt 617is made to create a socket with a port which has already been 618allocated; 619.It Bq Er EADDRNOTAVAIL 620when an attempt is made to create a 621socket with a network address for which no network interface 622exists; 623.It Bq Er EAFNOSUPPORT 624when an attempt is made to bind or connect a socket to a multicast 625address. 626.It Bq Er EINVAL 627when trying to change TCP function blocks at an invalid point in the session; 628.It Bq Er ENOENT 629when trying to use a TCP function block that is not available; 630.El 631.Sh SEE ALSO 632.Xr getsockopt 2 , 633.Xr socket 2 , 634.Xr sysctl 3 , 635.Xr blackhole 4 , 636.Xr inet 4 , 637.Xr intro 4 , 638.Xr ip 4 , 639.Xr mod_cc 4 , 640.Xr siftr 4 , 641.Xr syncache 4 , 642.Xr setkey 8 , 643.Xr tcp_functions 9 644.Rs 645.%A "V. Jacobson" 646.%A "R. Braden" 647.%A "D. Borman" 648.%T "TCP Extensions for High Performance" 649.%O "RFC 1323" 650.Re 651.Rs 652.%A "A. Heffernan" 653.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 654.%O "RFC 2385" 655.Re 656.Rs 657.%A "K. Ramakrishnan" 658.%A "S. Floyd" 659.%A "D. Black" 660.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 661.%O "RFC 3168" 662.Re 663.Sh HISTORY 664The 665.Tn TCP 666protocol appeared in 667.Bx 4.2 . 668The RFC 1323 extensions for window scaling and timestamps were added 669in 670.Bx 4.4 . 671The 672.Dv TCP_INFO 673option was introduced in 674.Tn Linux 2.6 675and is 676.Em subject to change . 677