1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 33.\" $FreeBSD$ 34.\" 35.Dd August 16, 2008 36.Dt TCP 4 37.Os 38.Sh NAME 39.Nm tcp 40.Nd Internet Transmission Control Protocol 41.Sh SYNOPSIS 42.In sys/types.h 43.In sys/socket.h 44.In netinet/in.h 45.Ft int 46.Fn socket AF_INET SOCK_STREAM 0 47.Sh DESCRIPTION 48The 49.Tn TCP 50protocol provides reliable, flow-controlled, two-way 51transmission of data. 52It is a byte-stream protocol used to 53support the 54.Dv SOCK_STREAM 55abstraction. 56.Tn TCP 57uses the standard 58Internet address format and, in addition, provides a per-host 59collection of 60.Dq "port addresses" . 61Thus, each address is composed 62of an Internet address specifying the host and network, 63with a specific 64.Tn TCP 65port on the host identifying the peer entity. 66.Pp 67Sockets utilizing the 68.Tn TCP 69protocol are either 70.Dq active 71or 72.Dq passive . 73Active sockets initiate connections to passive 74sockets. 75By default, 76.Tn TCP 77sockets are created active; to create a 78passive socket, the 79.Xr listen 2 80system call must be used 81after binding the socket with the 82.Xr bind 2 83system call. 84Only passive sockets may use the 85.Xr accept 2 86call to accept incoming connections. 87Only active sockets may use the 88.Xr connect 2 89call to initiate connections. 90.Pp 91Passive sockets may 92.Dq underspecify 93their location to match 94incoming connection requests from multiple networks. 95This technique, termed 96.Dq "wildcard addressing" , 97allows a single 98server to provide service to clients on multiple networks. 99To create a socket which listens on all networks, the Internet 100address 101.Dv INADDR_ANY 102must be bound. 103The 104.Tn TCP 105port may still be specified 106at this time; if the port is not specified, the system will assign one. 107Once a connection has been established, the socket's address is 108fixed by the peer entity's location. 109The address assigned to the 110socket is the address associated with the network interface 111through which packets are being transmitted and received. 112Normally, this address corresponds to the peer entity's network. 113.Pp 114.Tn TCP 115supports a number of socket options which can be set with 116.Xr setsockopt 2 117and tested with 118.Xr getsockopt 2 : 119.Bl -tag -width ".Dv TCP_NODELAY" 120.It Dv TCP_INFO 121Information about a socket's underlying TCP session may be retrieved 122by passing the read-only option 123.Dv TCP_INFO 124to 125.Xr getsockopt 2 . 126It accepts a single argument: a pointer to an instance of 127.Vt "struct tcp_info" . 128.Pp 129This API is subject to change; consult the source to determine 130which fields are currently filled out by this option. 131.Fx 132specific additions include 133send window size, 134receive window size, 135and 136bandwidth-controlled window space. 137.It Dv TCP_NODELAY 138Under most circumstances, 139.Tn TCP 140sends data when it is presented; 141when outstanding data has not yet been acknowledged, it gathers 142small amounts of output to be sent in a single packet once 143an acknowledgement is received. 144For a small number of clients, such as window systems 145that send a stream of mouse events which receive no replies, 146this packetization may cause significant delays. 147The boolean option 148.Dv TCP_NODELAY 149defeats this algorithm. 150.It Dv TCP_MAXSEG 151By default, a sender- and 152.No receiver- Ns Tn TCP 153will negotiate among themselves to determine the maximum segment size 154to be used for each connection. 155The 156.Dv TCP_MAXSEG 157option allows the user to determine the result of this negotiation, 158and to reduce it if desired. 159.It Dv TCP_NOOPT 160.Tn TCP 161usually sends a number of options in each packet, corresponding to 162various 163.Tn TCP 164extensions which are provided in this implementation. 165The boolean option 166.Dv TCP_NOOPT 167is provided to disable 168.Tn TCP 169option use on a per-connection basis. 170.It Dv TCP_NOPUSH 171By convention, the 172.No sender- Ns Tn TCP 173will set the 174.Dq push 175bit, and begin transmission immediately (if permitted) at the end of 176every user call to 177.Xr write 2 178or 179.Xr writev 2 . 180When this option is set to a non-zero value, 181.Tn TCP 182will delay sending any data at all until either the socket is closed, 183or the internal send buffer is filled. 184.It Dv TCP_MD5SIG 185This option enables the use of MD5 digests (also known as TCP-MD5) 186on writes to the specified socket. 187In the current release, only outgoing traffic is digested; 188digests on incoming traffic are not verified. 189The current default behavior for the system is to respond to a system 190advertising this option with TCP-MD5; this may change. 191.Pp 192One common use for this in a 193.Fx 194router deployment is to enable 195based routers to interwork with Cisco equipment at peering points. 196Support for this feature conforms to RFC 2385. 197Only IPv4 198.Pq Dv AF_INET 199sessions are supported. 200.Pp 201In order for this option to function correctly, it is necessary for the 202administrator to add a tcp-md5 key entry to the system's security 203associations database (SADB) using the 204.Xr setkey 8 205utility. 206This entry must have an SPI of 0x1000 and can therefore only be specified 207on a per-host basis at this time. 208.Pp 209If an SADB entry cannot be found for the destination, the outgoing traffic 210will have an invalid digest option prepended, and the following error message 211will be visible on the system console: 212.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" . 213.El 214.Pp 215The option level for the 216.Xr setsockopt 2 217call is the protocol number for 218.Tn TCP , 219available from 220.Xr getprotobyname 3 , 221or 222.Dv IPPROTO_TCP . 223All options are declared in 224.In netinet/tcp.h . 225.Pp 226Options at the 227.Tn IP 228transport level may be used with 229.Tn TCP ; 230see 231.Xr ip 4 . 232Incoming connection requests that are source-routed are noted, 233and the reverse source route is used in responding. 234.Ss MIB Variables 235The 236.Tn TCP 237protocol implements a number of variables in the 238.Va net.inet.tcp 239branch of the 240.Xr sysctl 3 241MIB. 242.Bl -tag -width ".Va TCPCTL_DO_RFC1323" 243.It Dv TCPCTL_DO_RFC1323 244.Pq Va rfc1323 245Implement the window scaling and timestamp options of RFC 1323 246(default is true). 247.It Dv TCPCTL_MSSDFLT 248.Pq Va mssdflt 249The default value used for the maximum segment size 250.Pq Dq MSS 251when no advice to the contrary is received from MSS negotiation. 252.It Dv TCPCTL_SENDSPACE 253.Pq Va sendspace 254Maximum 255.Tn TCP 256send window. 257.It Dv TCPCTL_RECVSPACE 258.Pq Va recvspace 259Maximum 260.Tn TCP 261receive window. 262.It Va log_in_vain 263Log any connection attempts to ports where there is not a socket 264accepting connections. 265The value of 1 limits the logging to 266.Tn SYN 267(connection establishment) packets only. 268That of 2 results in any 269.Tn TCP 270packets to closed ports being logged. 271Any value unlisted above disables the logging 272(default is 0, i.e., the logging is disabled). 273.It Va slowstart_flightsize 274The number of packets allowed to be in-flight during the 275.Tn TCP 276slow-start phase on a non-local network. 277.It Va local_slowstart_flightsize 278The number of packets allowed to be in-flight during the 279.Tn TCP 280slow-start phase to local machines in the same subnet. 281.It Va msl 282The Maximum Segment Lifetime, in milliseconds, for a packet. 283.It Va keepinit 284Timeout, in milliseconds, for new, non-established 285.Tn TCP 286connections. 287.It Va keepidle 288Amount of time, in milliseconds, that the connection must be idle 289before keepalive probes (if enabled) are sent. 290.It Va keepintvl 291The interval, in milliseconds, between keepalive probes sent to remote 292machines. 293After 294.Dv TCPTV_KEEPCNT 295(default 8) probes are sent, with no response, the connection is dropped. 296.It Va always_keepalive 297Assume that 298.Dv SO_KEEPALIVE 299is set on all 300.Tn TCP 301connections, the kernel will 302periodically send a packet to the remote host to verify the connection 303is still up. 304.It Va icmp_may_rst 305Certain 306.Tn ICMP 307unreachable messages may abort connections in 308.Tn SYN-SENT 309state. 310.It Va do_tcpdrain 311Flush packets in the 312.Tn TCP 313reassembly queue if the system is low on mbufs. 314.It Va blackhole 315If enabled, disable sending of RST when a connection is attempted 316to a port where there is not a socket accepting connections. 317See 318.Xr blackhole 4 . 319.It Va delayed_ack 320Delay ACK to try and piggyback it onto a data packet. 321.It Va delacktime 322Maximum amount of time, in milliseconds, before a delayed ACK is sent. 323.It Va newreno 324Enable 325.Tn TCP 326NewReno Fast Recovery algorithm, 327as described in RFC 2582. 328.It Va path_mtu_discovery 329Enable Path MTU Discovery. 330.It Va tcbhashsize 331Size of the 332.Tn TCP 333control-block hash table 334(read-only). 335This may be tuned using the kernel option 336.Dv TCBHASHSIZE 337or by setting 338.Va net.inet.tcp.tcbhashsize 339in the 340.Xr loader 8 . 341.It Va pcbcount 342Number of active process control blocks 343(read-only). 344.It Va syncookies 345Determines whether or not 346.Tn SYN 347cookies should be generated for outbound 348.Tn SYN-ACK 349packets. 350.Tn SYN 351cookies are a great help during 352.Tn SYN 353flood attacks, and are enabled by default. 354(See 355.Xr syncookies 4 . ) 356.It Va isn_reseed_interval 357The interval (in seconds) specifying how often the secret data used in 358RFC 1948 initial sequence number calculations should be reseeded. 359By default, this variable is set to zero, indicating that 360no reseeding will occur. 361Reseeding should not be necessary, and will break 362.Dv TIME_WAIT 363recycling for a few minutes. 364.It Va rexmit_min , rexmit_slop 365Adjust the retransmit timer calculation for 366.Tn TCP . 367The slop is 368typically added to the raw calculation to take into account 369occasional variances that the 370.Tn SRTT 371(smoothed round-trip time) 372is unable to accommodate, while the minimum specifies an 373absolute minimum. 374While a number of 375.Tn TCP 376RFCs suggest a 1 377second minimum, these RFCs tend to focus on streaming behavior, 378and fail to deal with the fact that a 1 second minimum has severe 379detrimental effects over lossy interactive connections, such 380as a 802.11b wireless link, and over very fast but lossy 381connections for those cases not covered by the fast retransmit 382code. 383For this reason, we use 200ms of slop and a near-0 384minimum, which gives us an effective minimum of 200ms (similar to 385.Tn Linux ) . 386.It Va inflight.enable 387Enable 388.Tn TCP 389bandwidth-delay product limiting. 390An attempt will be made to calculate 391the bandwidth-delay product for each individual 392.Tn TCP 393connection, and limit 394the amount of inflight data being transmitted, to avoid building up 395unnecessary packets in the network. 396This option is recommended if you 397are serving a lot of data over connections with high bandwidth-delay 398products, such as modems, GigE links, and fast long-haul WANs, and/or 399you have configured your machine to accommodate large 400.Tn TCP 401windows. 402In such 403situations, without this option, you may experience high interactive 404latencies or packet loss due to the overloading of intermediate routers 405and switches. 406Note that bandwidth-delay product limiting only effects 407the transmit side of a 408.Tn TCP 409connection. 410.It Va inflight.debug 411Enable debugging for the bandwidth-delay product algorithm. 412.It Va inflight.min 413This puts a lower bound on the bandwidth-delay product window, in bytes. 414A value of 1024 is typically used for debugging. 4156000-16000 is more typical in a production installation. 416Setting this value too low may result in 417slow ramp-up times for bursty connections. 418Setting this value too high effectively disables the algorithm. 419.It Va inflight.max 420This puts an upper bound on the bandwidth-delay product window, in bytes. 421This value should not generally be modified, but may be used to set a 422global per-connection limit on queued data, potentially allowing you to 423intentionally set a less than optimum limit, to smooth data flow over a 424network while still being able to specify huge internal 425.Tn TCP 426buffers. 427.It Va inflight.stab 428The bandwidth-delay product algorithm requires a slightly larger window 429than it otherwise calculates for stability. 430This parameter determines the extra window in maximal packets / 10. 431The default value of 20 represents 2 maximal packets. 432Reducing this value is not recommended, but you may 433come across a situation with very slow links where the 434.Xr ping 8 435time 436reduction of the default inflight code is not sufficient. 437If this case occurs, you should first try reducing 438.Va inflight.min 439and, if that does not 440work, reduce both 441.Va inflight.min 442and 443.Va inflight.stab , 444trying values of 44515, 10, or 5 for the latter. 446Never use a value less than 5. 447Reducing 448.Va inflight.stab 449can lead to upwards of a 20% underutilization of the link 450as well as reducing the algorithm's ability to adapt to changing 451situations and should only be done as a last resort. 452.It Va rfc3042 453Enable the Limited Transmit algorithm as described in RFC 3042. 454It helps avoid timeouts on lossy links and also when the congestion window 455is small, as happens on short transfers. 456.It Va rfc3390 457Enable support for RFC 3390, which allows for a variable-sized 458starting congestion window on new connections, depending on the 459maximum segment size. 460This helps throughput in general, but 461particularly affects short transfers and high-bandwidth large 462propagation-delay connections. 463.Pp 464When this feature is enabled, the 465.Va slowstart_flightsize 466and 467.Va local_slowstart_flightsize 468settings are not observed for new 469connection slow starts, but they are still used for slow starts 470that occur when the connection has been idle and starts sending 471again. 472.It Va sack.enable 473Enable support for RFC 2018, TCP Selective Acknowledgment option, 474which allows the receiver to inform the sender about all successfully 475arrived segments, allowing the sender to retransmit the missing segments 476only. 477.It Va sack.maxholes 478Maximum number of SACK holes per connection. 479Defaults to 128. 480.It Va sack.globalmaxholes 481Maximum number of SACK holes per system, across all connections. 482Defaults to 65536. 483.It Va maxtcptw 484When a TCP connection enters the 485.Dv TIME_WAIT 486state, its associated socket structure is freed, since it is of 487negligible size and use, and a new structure is allocated to contain a 488minimal amount of information necessary for sustaining a connection in 489this state, called the compressed TCP TIME_WAIT state. 490Since this structure is smaller than a socket structure, it can save 491a significant amount of system memory. 492The 493.Va net.inet.tcp.maxtcptw 494MIB variable controls the maximum number of these structures allocated. 495By default, it is initialized to 496.Va kern.ipc.maxsockets 497/ 5. 498.It Va nolocaltimewait 499Suppress creating of compressed TCP TIME_WAIT states for connections in 500which both endpoints are local. 501.It Va fast_finwait2_recycle 502Recycle 503.Tn TCP 504.Dv FIN_WAIT_2 505connections faster when the socket is marked as 506.Dv SBS_CANTRCVMORE 507(no user process has the socket open, data received on 508the socket cannot be read). 509The timeout used here is 510.Va finwait2_timeout . 511.It Va finwait2_timeout 512Timeout to use for fast recycling of 513.Tn TCP 514.Dv FIN_WAIT_2 515connections. 516Defaults to 60 seconds. 517.It Va ecn.enable 518Enable support for TCP Explicit Congestion Notification (ECN). 519ECN allows a TCP sender to reduce the transmission rate in order to 520avoid packet drops. 521.It Va ecn.maxretries 522Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a 523specific connection. This is needed to help with connection establishment 524when a broken firewall is in the network path. 525.El 526.Sh ERRORS 527A socket operation may fail with one of the following errors returned: 528.Bl -tag -width Er 529.It Bq Er EISCONN 530when trying to establish a connection on a socket which 531already has one; 532.It Bq Er ENOBUFS 533when the system runs out of memory for 534an internal data structure; 535.It Bq Er ETIMEDOUT 536when a connection was dropped 537due to excessive retransmissions; 538.It Bq Er ECONNRESET 539when the remote peer 540forces the connection to be closed; 541.It Bq Er ECONNREFUSED 542when the remote 543peer actively refuses connection establishment (usually because 544no process is listening to the port); 545.It Bq Er EADDRINUSE 546when an attempt 547is made to create a socket with a port which has already been 548allocated; 549.It Bq Er EADDRNOTAVAIL 550when an attempt is made to create a 551socket with a network address for which no network interface 552exists; 553.It Bq Er EAFNOSUPPORT 554when an attempt is made to bind or connect a socket to a multicast 555address. 556.El 557.Sh SEE ALSO 558.Xr getsockopt 2 , 559.Xr socket 2 , 560.Xr sysctl 3 , 561.Xr blackhole 4 , 562.Xr inet 4 , 563.Xr intro 4 , 564.Xr ip 4 , 565.Xr syncache 4 , 566.Xr setkey 8 567.Rs 568.%A "V. Jacobson" 569.%A "R. Braden" 570.%A "D. Borman" 571.%T "TCP Extensions for High Performance" 572.%O "RFC 1323" 573.Re 574.Rs 575.%A "A. Heffernan" 576.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 577.%O "RFC 2385" 578.Re 579.Rs 580.%A "K. Ramakrishnan" 581.%A "S. Floyd" 582.%A "D. Black" 583.%T "The Addition of Explicit Congestion Notification (ECN) to IP" 584.%O "RFC 3168" 585.Re 586.Sh HISTORY 587The 588.Tn TCP 589protocol appeared in 590.Bx 4.2 . 591The RFC 1323 extensions for window scaling and timestamps were added 592in 593.Bx 4.4 . 594The 595.Dv TCP_INFO 596option was introduced in 597.Tn Linux 2.6 598and is 599.Em subject to change . 600