1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. 3.\" Copyright (c) 2010-2011 The FreeBSD Foundation 4.\" All rights reserved. 5.\" 6.\" Portions of this documentation were written at the Centre for Advanced 7.\" Internet Architectures, Swinburne University of Technology, Melbourne, 8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 39.\" $FreeBSD$ 40.\" 41.Dd March 7, 2012 42.Dt TCP 4 43.Os 44.Sh NAME 45.Nm tcp 46.Nd Internet Transmission Control Protocol 47.Sh SYNOPSIS 48.In sys/types.h 49.In sys/socket.h 50.In netinet/in.h 51.Ft int 52.Fn socket AF_INET SOCK_STREAM 0 53.Sh DESCRIPTION 54The 55.Tn TCP 56protocol provides reliable, flow-controlled, two-way 57transmission of data. 58It is a byte-stream protocol used to 59support the 60.Dv SOCK_STREAM 61abstraction. 62.Tn TCP 63uses the standard 64Internet address format and, in addition, provides a per-host 65collection of 66.Dq "port addresses" . 67Thus, each address is composed 68of an Internet address specifying the host and network, 69with a specific 70.Tn TCP 71port on the host identifying the peer entity. 72.Pp 73Sockets utilizing the 74.Tn TCP 75protocol are either 76.Dq active 77or 78.Dq passive . 79Active sockets initiate connections to passive 80sockets. 81By default, 82.Tn TCP 83sockets are created active; to create a 84passive socket, the 85.Xr listen 2 86system call must be used 87after binding the socket with the 88.Xr bind 2 89system call. 90Only passive sockets may use the 91.Xr accept 2 92call to accept incoming connections. 93Only active sockets may use the 94.Xr connect 2 95call to initiate connections. 96.Pp 97Passive sockets may 98.Dq underspecify 99their location to match 100incoming connection requests from multiple networks. 101This technique, termed 102.Dq "wildcard addressing" , 103allows a single 104server to provide service to clients on multiple networks. 105To create a socket which listens on all networks, the Internet 106address 107.Dv INADDR_ANY 108must be bound. 109The 110.Tn TCP 111port may still be specified 112at this time; if the port is not specified, the system will assign one. 113Once a connection has been established, the socket's address is 114fixed by the peer entity's location. 115The address assigned to the 116socket is the address associated with the network interface 117through which packets are being transmitted and received. 118Normally, this address corresponds to the peer entity's network. 119.Pp 120.Tn TCP 121supports a number of socket options which can be set with 122.Xr setsockopt 2 123and tested with 124.Xr getsockopt 2 : 125.Bl -tag -width ".Dv TCP_CONGESTION" 126.It Dv TCP_INFO 127Information about a socket's underlying TCP session may be retrieved 128by passing the read-only option 129.Dv TCP_INFO 130to 131.Xr getsockopt 2 . 132It accepts a single argument: a pointer to an instance of 133.Vt "struct tcp_info" . 134.Pp 135This API is subject to change; consult the source to determine 136which fields are currently filled out by this option. 137.Fx 138specific additions include 139send window size, 140receive window size, 141and 142bandwidth-controlled window space. 143.It Dv TCP_CONGESTION 144Select or query the congestion control algorithm that TCP will use for the 145connection. 146See 147.Xr mod_cc 4 148for details. 149.It Dv TCP_KEEPINIT 150This write-only 151.Xr setsockopt 2 152option accepts a per-socket timeout argument of 153.Vt "u_int" 154in seconds, for new, non-established 155.Tn TCP 156connections. 157For the global default in milliseconds see 158.Va keepinit 159in the 160.Sx MIB Variables 161section further down. 162.It Dv TCP_KEEPIDLE 163This write-only 164.Xr setsockopt 2 165option accepts an argument of 166.Vt "u_int" 167for the amount of time, in seconds, that the connection must be idle 168before keepalive probes (if enabled) are sent for the connection of this 169socket. 170If set on a listening socket, the value is inherited by the newly created 171socket upon 172.Xr accept 2 . 173For the global default in milliseconds see 174.Va keepidle 175in the 176.Sx MIB Variables 177section further down. 178.It Dv TCP_KEEPINTVL 179This write-only 180.Xr setsockopt 2 181option accepts an argument of 182.Vt "u_int" 183to set the per-socket interval, in seconds, between keepalive probes sent 184to a peer. 185If set on a listening socket, the value is inherited by the newly created 186socket upon 187.Xr accept 2 . 188For the global default in milliseconds see 189.Va keepintvl 190in the 191.Sx MIB Variables 192section further down. 193.It Dv TCP_KEEPCNT 194This write-only 195.Xr setsockopt 2 196option accepts an argument of 197.Vt "u_int" 198and allows a per-socket tuning of the number of probes sent, with no response, 199before the connection will be dropped. 200If set on a listening socket, the value is inherited by the newly created 201socket upon 202.Xr accept 2 . 203For the global default see the 204.Va keepcnt 205in the 206.Sx MIB Variables 207section further down. 208.It Dv TCP_NODELAY 209Under most circumstances, 210.Tn TCP 211sends data when it is presented; 212when outstanding data has not yet been acknowledged, it gathers 213small amounts of output to be sent in a single packet once 214an acknowledgement is received. 215For a small number of clients, such as window systems 216that send a stream of mouse events which receive no replies, 217this packetization may cause significant delays. 218The boolean option 219.Dv TCP_NODELAY 220defeats this algorithm. 221.It Dv TCP_MAXSEG 222By default, a sender- and 223.No receiver- Ns Tn TCP 224will negotiate among themselves to determine the maximum segment size 225to be used for each connection. 226The 227.Dv TCP_MAXSEG 228option allows the user to determine the result of this negotiation, 229and to reduce it if desired. 230.It Dv TCP_NOOPT 231.Tn TCP 232usually sends a number of options in each packet, corresponding to 233various 234.Tn TCP 235extensions which are provided in this implementation. 236The boolean option 237.Dv TCP_NOOPT 238is provided to disable 239.Tn TCP 240option use on a per-connection basis. 241.It Dv TCP_NOPUSH 242By convention, the 243.No sender- Ns Tn TCP 244will set the 245.Dq push 246bit, and begin transmission immediately (if permitted) at the end of 247every user call to 248.Xr write 2 249or 250.Xr writev 2 . 251When this option is set to a non-zero value, 252.Tn TCP 253will delay sending any data at all until either the socket is closed, 254or the internal send buffer is filled. 255.It Dv TCP_MD5SIG 256This option enables the use of MD5 digests (also known as TCP-MD5) 257on writes to the specified socket. 258Outgoing traffic is digested; 259digests on incoming traffic are verified if the 260.Va net.inet.tcp.signature_verify_input 261sysctl is nonzero. 262The current default behavior for the system is to respond to a system 263advertising this option with TCP-MD5; this may change. 264.Pp 265One common use for this in a 266.Fx 267router deployment is to enable 268based routers to interwork with Cisco equipment at peering points. 269Support for this feature conforms to RFC 2385. 270Only IPv4 271.Pq Dv AF_INET 272sessions are supported. 273.Pp 274In order for this option to function correctly, it is necessary for the 275administrator to add a tcp-md5 key entry to the system's security 276associations database (SADB) using the 277.Xr setkey 8 278utility. 279This entry must have an SPI of 0x1000 and can therefore only be specified 280on a per-host basis at this time. 281.Pp 282If an SADB entry cannot be found for the destination, the outgoing traffic 283will have an invalid digest option prepended, and the following error message 284will be visible on the system console: 285.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 286.El 287.Pp 288The option level for the 289.Xr setsockopt 2 290call is the protocol number for 291.Tn TCP , 292available from 293.Xr getprotobyname 3 , 294or 295.Dv IPPROTO_TCP . 296All options are declared in 297.In netinet/tcp.h . 298.Pp 299Options at the 300.Tn IP 301transport level may be used with 302.Tn TCP ; 303see 304.Xr ip 4 . 305Incoming connection requests that are source-routed are noted, 306and the reverse source route is used in responding. 307.Pp 308The default congestion control algorithm for 309.Tn TCP 310is 311.Xr cc_newreno 4 . 312Other congestion control algorithms can be made available using the 313.Xr mod_cc 4 314framework. 315.Ss MIB Variables 316The 317.Tn TCP 318protocol implements a number of variables in the 319.Va net.inet.tcp 320branch of the 321.Xr sysctl 3 322MIB. 323.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 324.It Dv TCPCTL_DO_RFC1323 325.Pq Va rfc1323 326Implement the window scaling and timestamp options of RFC 1323 327(default is true). 328.It Dv TCPCTL_MSSDFLT 329.Pq Va mssdflt 330The default value used for the maximum segment size 331.Pq Dq MSS 332when no advice to the contrary is received from MSS negotiation. 333.It Dv TCPCTL_SENDSPACE 334.Pq Va sendspace 335Maximum 336.Tn TCP 337send window. 338.It Dv TCPCTL_RECVSPACE 339.Pq Va recvspace 340Maximum 341.Tn TCP 342receive window. 343.It Va log_in_vain 344Log any connection attempts to ports where there is not a socket 345accepting connections. 346The value of 1 limits the logging to 347.Tn SYN 348(connection establishment) packets only. 349That of 2 results in any 350.Tn TCP 351packets to closed ports being logged. 352Any value unlisted above disables the logging 353(default is 0, i.e., the logging is disabled). 354.It Va msl 355The Maximum Segment Lifetime, in milliseconds, for a packet. 356.It Va keepinit 357Timeout, in milliseconds, for new, non-established 358.Tn TCP 359connections. 360The default is 75000 msec. 361.It Va keepidle 362Amount of time, in milliseconds, that the connection must be idle 363before keepalive probes (if enabled) are sent. 364The default is 7200000 msec (2 hours). 365.It Va keepintvl 366The interval, in milliseconds, between keepalive probes sent to remote 367machines, when no response is received on a 368.Va keepidle 369probe. 370The default is 75000 msec. 371.It Va keepcnt 372Number of probes sent, with no response, before a connection 373is dropped. 374The default is 8 packets. 375.It Va always_keepalive 376Assume that 377.Dv SO_KEEPALIVE 378is set on all 379.Tn TCP 380connections, the kernel will 381periodically send a packet to the remote host to verify the connection 382is still up. 383.It Va icmp_may_rst 384Certain 385.Tn ICMP 386unreachable messages may abort connections in 387.Tn SYN-SENT 388state. 389.It Va do_tcpdrain 390Flush packets in the 391.Tn TCP 392reassembly queue if the system is low on mbufs. 393.It Va blackhole 394If enabled, disable sending of RST when a connection is attempted 395to a port where there is not a socket accepting connections. 396See 397.Xr blackhole 4 . 398.It Va delayed_ack 399Delay ACK to try and piggyback it onto a data packet. 400.It Va delacktime 401Maximum amount of time, in milliseconds, before a delayed ACK is sent. 402.It Va path_mtu_discovery 403Enable Path MTU Discovery. 404.It Va tcbhashsize 405Size of the 406.Tn TCP 407control-block hash table 408(read-only). 409This may be tuned using the kernel option 410.Dv TCBHASHSIZE 411or by setting 412.Va net.inet.tcp.tcbhashsize 413in the 414.Xr loader 8 . 415.It Va pcbcount 416Number of active process control blocks 417(read-only). 418.It Va syncookies 419Determines whether or not 420.Tn SYN 421cookies should be generated for outbound 422.Tn SYN-ACK 423packets. 424.Tn SYN 425cookies are a great help during 426.Tn SYN 427flood attacks, and are enabled by default. 428(See 429.Xr syncookies 4 . ) 430.It Va isn_reseed_interval 431The interval (in seconds) specifying how often the secret data used in 432RFC 1948 initial sequence number calculations should be reseeded. 433By default, this variable is set to zero, indicating that 434no reseeding will occur. 435Reseeding should not be necessary, and will break 436.Dv TIME_WAIT 437recycling for a few minutes. 438.It Va rexmit_min , rexmit_slop 439Adjust the retransmit timer calculation for 440.Tn TCP . 441The slop is 442typically added to the raw calculation to take into account 443occasional variances that the 444.Tn SRTT 445(smoothed round-trip time) 446is unable to accommodate, while the minimum specifies an 447absolute minimum. 448While a number of 449.Tn TCP 450RFCs suggest a 1 451second minimum, these RFCs tend to focus on streaming behavior, 452and fail to deal with the fact that a 1 second minimum has severe 453detrimental effects over lossy interactive connections, such 454as a 802.11b wireless link, and over very fast but lossy 455connections for those cases not covered by the fast retransmit 456code. 457For this reason, we use 200ms of slop and a near-0 458minimum, which gives us an effective minimum of 200ms (similar to 459.Tn Linux ) . 460.It Va rfc3042 461Enable the Limited Transmit algorithm as described in RFC 3042. 462It helps avoid timeouts on lossy links and also when the congestion window 463is small, as happens on short transfers. 464.It Va rfc3390 465Enable support for RFC 3390, which allows for a variable-sized 466starting congestion window on new connections, depending on the 467maximum segment size. 468This helps throughput in general, but 469particularly affects short transfers and high-bandwidth large 470propagation-delay connections. 471.It Va sack.enable 472Enable support for RFC 2018, TCP Selective Acknowledgment option, 473which allows the receiver to inform the sender about all successfully 474arrived segments, allowing the sender to retransmit the missing segments 475only. 476.It Va sack.maxholes 477Maximum number of SACK holes per connection. 478Defaults to 128. 479.It Va sack.globalmaxholes 480Maximum number of SACK holes per system, across all connections. 481Defaults to 65536. 482.It Va maxtcptw 483When a TCP connection enters the 484.Dv TIME_WAIT 485state, its associated socket structure is freed, since it is of 486negligible size and use, and a new structure is allocated to contain a 487minimal amount of information necessary for sustaining a connection in 488this state, called the compressed TCP TIME_WAIT state. 489Since this structure is smaller than a socket structure, it can save 490a significant amount of system memory. 491The 492.Va net.inet.tcp.maxtcptw 493MIB variable controls the maximum number of these structures allocated. 494By default, it is initialized to 495.Va kern.ipc.maxsockets 496/ 5. 497.It Va nolocaltimewait 498Suppress creating of compressed TCP TIME_WAIT states for connections in 499which both endpoints are local. 500.It Va fast_finwait2_recycle 501Recycle 502.Tn TCP 503.Dv FIN_WAIT_2 504connections faster when the socket is marked as 505.Dv SBS_CANTRCVMORE 506(no user process has the socket open, data received on 507the socket cannot be read). 508The timeout used here is 509.Va finwait2_timeout . 510.It Va finwait2_timeout 511Timeout to use for fast recycling of 512.Tn TCP 513.Dv FIN_WAIT_2 514connections. 515Defaults to 60 seconds. 516.It Va ecn.enable 517Enable support for TCP Explicit Congestion Notification (ECN). 518ECN allows a TCP sender to reduce the transmission rate in order to 519avoid packet drops. 520.It Va ecn.maxretries 521Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 522specific connection. This is needed to help with connection establishment 523when a broken firewall is in the network path. 524.El 525.Sh ERRORS 526A socket operation may fail with one of the following errors returned: 527.Bl -tag -width Er 528.It Bq Er EISCONN 529when trying to establish a connection on a socket which 530already has one; 531.It Bq Er ENOBUFS 532when the system runs out of memory for 533an internal data structure; 534.It Bq Er ETIMEDOUT 535when a connection was dropped 536due to excessive retransmissions; 537.It Bq Er ECONNRESET 538when the remote peer 539forces the connection to be closed; 540.It Bq Er ECONNREFUSED 541when the remote 542peer actively refuses connection establishment (usually because 543no process is listening to the port); 544.It Bq Er EADDRINUSE 545when an attempt 546is made to create a socket with a port which has already been 547allocated; 548.It Bq Er EADDRNOTAVAIL 549when an attempt is made to create a 550socket with a network address for which no network interface 551exists; 552.It Bq Er EAFNOSUPPORT 553when an attempt is made to bind or connect a socket to a multicast 554address. 555.El 556.Sh SEE ALSO 557.Xr getsockopt 2 , 558.Xr socket 2 , 559.Xr sysctl 3 , 560.Xr blackhole 4 , 561.Xr inet 4 , 562.Xr intro 4 , 563.Xr ip 4 , 564.Xr mod_cc 4 , 565.Xr siftr 4 , 566.Xr syncache 4 , 567.Xr setkey 8 568.Rs 569.%A "V. Jacobson" 570.%A "R. Braden" 571.%A "D. Borman" 572.%T "TCP Extensions for High Performance" 573.%O "RFC 1323" 574.Re 575.Rs 576.%A "A. Heffernan" 577.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 578.%O "RFC 2385" 579.Re 580.Rs 581.%A "K. Ramakrishnan" 582.%A "S. Floyd" 583.%A "D. Black" 584.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 585.%O "RFC 3168" 586.Re 587.Sh HISTORY 588The 589.Tn TCP 590protocol appeared in 591.Bx 4.2 . 592The RFC 1323 extensions for window scaling and timestamps were added 593in 594.Bx 4.4 . 595The 596.Dv TCP_INFO 597option was introduced in 598.Tn Linux 2.6 599and is 600.Em subject to change . 601