xref: /freebsd/share/man/man4/tcp.4 (revision 87569f75a91f298c52a71823c04d41cf53c88889)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD$
34.\"
35.Dd August 25, 2005
36.Dt TCP 4
37.Os
38.Sh NAME
39.Nm tcp
40.Nd Internet Transmission Control Protocol
41.Sh SYNOPSIS
42.In sys/types.h
43.In sys/socket.h
44.In netinet/in.h
45.Ft int
46.Fn socket AF_INET SOCK_STREAM 0
47.Sh DESCRIPTION
48The
49.Tn TCP
50protocol provides reliable, flow-controlled, two-way
51transmission of data.
52It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.
56.Tn TCP
57uses the standard
58Internet address format and, in addition, provides a per-host
59collection of
60.Dq "port addresses" .
61Thus, each address is composed
62of an Internet address specifying the host and network,
63with a specific
64.Tn TCP
65port on the host identifying the peer entity.
66.Pp
67Sockets utilizing the
68.Tn TCP
69protocol are either
70.Dq active
71or
72.Dq passive .
73Active sockets initiate connections to passive
74sockets.
75By default,
76.Tn TCP
77sockets are created active; to create a
78passive socket, the
79.Xr listen 2
80system call must be used
81after binding the socket with the
82.Xr bind 2
83system call.
84Only passive sockets may use the
85.Xr accept 2
86call to accept incoming connections.
87Only active sockets may use the
88.Xr connect 2
89call to initiate connections.
90.Pp
91Passive sockets may
92.Dq underspecify
93their location to match
94incoming connection requests from multiple networks.
95This technique, termed
96.Dq "wildcard addressing" ,
97allows a single
98server to provide service to clients on multiple networks.
99To create a socket which listens on all networks, the Internet
100address
101.Dv INADDR_ANY
102must be bound.
103The
104.Tn TCP
105port may still be specified
106at this time; if the port is not specified, the system will assign one.
107Once a connection has been established, the socket's address is
108fixed by the peer entity's location.
109The address assigned to the
110socket is the address associated with the network interface
111through which packets are being transmitted and received.
112Normally, this address corresponds to the peer entity's network.
113.Pp
114.Tn TCP
115supports a number of socket options which can be set with
116.Xr setsockopt 2
117and tested with
118.Xr getsockopt 2 :
119.Bl -tag -width ".Dv TCP_NODELAY"
120.It Dv TCP_NODELAY
121Under most circumstances,
122.Tn TCP
123sends data when it is presented;
124when outstanding data has not yet been acknowledged, it gathers
125small amounts of output to be sent in a single packet once
126an acknowledgement is received.
127For a small number of clients, such as window systems
128that send a stream of mouse events which receive no replies,
129this packetization may cause significant delays.
130The boolean option
131.Dv TCP_NODELAY
132defeats this algorithm.
133.It Dv TCP_MAXSEG
134By default, a sender- and
135.No receiver- Ns Tn TCP
136will negotiate among themselves to determine the maximum segment size
137to be used for each connection.
138The
139.Dv TCP_MAXSEG
140option allows the user to determine the result of this negotiation,
141and to reduce it if desired.
142.It Dv TCP_NOOPT
143.Tn TCP
144usually sends a number of options in each packet, corresponding to
145various
146.Tn TCP
147extensions which are provided in this implementation.
148The boolean option
149.Dv TCP_NOOPT
150is provided to disable
151.Tn TCP
152option use on a per-connection basis.
153.It Dv TCP_NOPUSH
154By convention, the
155.No sender- Ns Tn TCP
156will set the
157.Dq push
158bit, and begin transmission immediately (if permitted) at the end of
159every user call to
160.Xr write 2
161or
162.Xr writev 2 .
163When this option is set to a non-zero value,
164.Tn TCP
165will delay sending any data at all until either the socket is closed,
166or the internal send buffer is filled.
167.It Dv TCP_MD5SIG
168This option enables the use of MD5 digests (also known as TCP-MD5)
169on writes to the specified socket.
170In the current release, only outgoing traffic is digested;
171digests on incoming traffic are not verified.
172The current default behavior for the system is to respond to a system
173advertising this option with TCP-MD5; this may change.
174.Pp
175One common use for this in a
176.Fx
177router deployment is to enable
178based routers to interwork with Cisco equipment at peering points.
179Support for this feature conforms to RFC 2385.
180Only IPv4
181.Pq Dv AF_INET
182sessions are supported.
183.Pp
184In order for this option to function correctly, it is necessary for the
185administrator to add a tcp-md5 key entry to the system's security
186associations database (SADB) using the
187.Xr setkey 8
188utility.
189This entry must have an SPI of 0x1000 and can therefore only be specified
190on a per-host basis at this time.
191.Pp
192If an SADB entry cannot be found for the destination, the outgoing traffic
193will have an invalid digest option prepended, and the following error message
194will be visible on the system console:
195.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
196.El
197.Pp
198The option level for the
199.Xr setsockopt 2
200call is the protocol number for
201.Tn TCP ,
202available from
203.Xr getprotobyname 3 ,
204or
205.Dv IPPROTO_TCP .
206All options are declared in
207.In netinet/tcp.h .
208.Pp
209Options at the
210.Tn IP
211transport level may be used with
212.Tn TCP ;
213see
214.Xr ip 4 .
215Incoming connection requests that are source-routed are noted,
216and the reverse source route is used in responding.
217.Ss MIB Variables
218The
219.Tn TCP
220protocol implements a number of variables in the
221.Va net.inet.tcp
222branch of the
223.Xr sysctl 3
224MIB.
225.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
226.It Dv TCPCTL_DO_RFC1323
227.Pq Va rfc1323
228Implement the window scaling and timestamp options of RFC 1323
229(default is true).
230.It Dv TCPCTL_MSSDFLT
231.Pq Va mssdflt
232The default value used for the maximum segment size
233.Pq Dq MSS
234when no advice to the contrary is received from MSS negotiation.
235.It Dv TCPCTL_SENDSPACE
236.Pq Va sendspace
237Maximum
238.Tn TCP
239send window.
240.It Dv TCPCTL_RECVSPACE
241.Pq Va recvspace
242Maximum
243.Tn TCP
244receive window.
245.It Va log_in_vain
246Log any connection attempts to ports where there is not a socket
247accepting connections.
248The value of 1 limits the logging to
249.Tn SYN
250(connection establishment) packets only.
251That of 2 results in any
252.Tn TCP
253packets to closed ports being logged.
254Any value unlisted above disables the logging
255(default is 0, i.e., the logging is disabled).
256.It Va slowstart_flightsize
257The number of packets allowed to be in-flight during the
258.Tn TCP
259slow-start phase on a non-local network.
260.It Va local_slowstart_flightsize
261The number of packets allowed to be in-flight during the
262.Tn TCP
263slow-start phase to local machines in the same subnet.
264.It Va msl
265The Maximum Segment Lifetime, in milliseconds, for a packet.
266.It Va keepinit
267Timeout, in milliseconds, for new, non-established
268.Tn TCP
269connections.
270.It Va keepidle
271Amount of time, in milliseconds, that the connection must be idle
272before keepalive probes (if enabled) are sent.
273.It Va keepintvl
274The interval, in milliseconds, between keepalive probes sent to remote
275machines.
276After
277.Dv TCPTV_KEEPCNT
278(default 8) probes are sent, with no response, the connection is dropped.
279.It Va always_keepalive
280Assume that
281.Dv SO_KEEPALIVE
282is set on all
283.Tn TCP
284connections, the kernel will
285periodically send a packet to the remote host to verify the connection
286is still up.
287.It Va icmp_may_rst
288Certain
289.Tn ICMP
290unreachable messages may abort connections in
291.Tn SYN-SENT
292state.
293.It Va do_tcpdrain
294Flush packets in the
295.Tn TCP
296reassembly queue if the system is low on mbufs.
297.It Va blackhole
298If enabled, disable sending of RST when a connection is attempted
299to a port where there is not a socket accepting connections.
300See
301.Xr blackhole 4 .
302.It Va delayed_ack
303Delay ACK to try and piggyback it onto a data packet.
304.It Va delacktime
305Maximum amount of time, in milliseconds, before a delayed ACK is sent.
306.It Va newreno
307Enable
308.Tn TCP
309NewReno Fast Recovery algorithm,
310as described in RFC 2582.
311.It Va path_mtu_discovery
312Enable Path MTU Discovery.
313.It Va tcbhashsize
314Size of the
315.Tn TCP
316control-block hash table
317(read-only).
318This may be tuned using the kernel option
319.Dv TCBHASHSIZE
320or by setting
321.Va net.inet.tcp.tcbhashsize
322in the
323.Xr loader 8 .
324.It Va pcbcount
325Number of active process control blocks
326(read-only).
327.It Va syncookies
328Determines whether or not
329.Tn SYN
330cookies should be generated for outbound
331.Tn SYN-ACK
332packets.
333.Tn SYN
334cookies are a great help during
335.Tn SYN
336flood attacks, and are enabled by default.
337(See
338.Xr syncookies 4 . )
339.It Va isn_reseed_interval
340The interval (in seconds) specifying how often the secret data used in
341RFC 1948 initial sequence number calculations should be reseeded.
342By default, this variable is set to zero, indicating that
343no reseeding will occur.
344Reseeding should not be necessary, and will break
345.Dv TIME_WAIT
346recycling for a few minutes.
347.It Va rexmit_min , rexmit_slop
348Adjust the retransmit timer calculation for
349.Tn TCP .
350The slop is
351typically added to the raw calculation to take into account
352occasional variances that the
353.Tn SRTT
354(smoothed round-trip time)
355is unable to accommodate, while the minimum specifies an
356absolute minimum.
357While a number of
358.Tn TCP
359RFCs suggest a 1
360second minimum, these RFCs tend to focus on streaming behavior,
361and fail to deal with the fact that a 1 second minimum has severe
362detrimental effects over lossy interactive connections, such
363as a 802.11b wireless link, and over very fast but lossy
364connections for those cases not covered by the fast retransmit
365code.
366For this reason, we use 200ms of slop and a near-0
367minimum, which gives us an effective minimum of 200ms (similar to
368.Tn Linux ) .
369.It Va inflight.enable
370Enable
371.Tn TCP
372bandwidth-delay product limiting.
373An attempt will be made to calculate
374the bandwidth-delay product for each individual
375.Tn TCP
376connection, and limit
377the amount of inflight data being transmitted, to avoid building up
378unnecessary packets in the network.
379This option is recommended if you
380are serving a lot of data over connections with high bandwidth-delay
381products, such as modems, GigE links, and fast long-haul WANs, and/or
382you have configured your machine to accommodate large
383.Tn TCP
384windows.
385In such
386situations, without this option, you may experience high interactive
387latencies or packet loss due to the overloading of intermediate routers
388and switches.
389Note that bandwidth-delay product limiting only effects
390the transmit side of a
391.Tn TCP
392connection.
393.It Va inflight.debug
394Enable debugging for the bandwidth-delay product algorithm.
395.It Va inflight.min
396This puts a lower bound on the bandwidth-delay product window, in bytes.
397A value of 1024 is typically used for debugging.
3986000-16000 is more typical in a production installation.
399Setting this value too low may result in
400slow ramp-up times for bursty connections.
401Setting this value too high effectively disables the algorithm.
402.It Va inflight.max
403This puts an upper bound on the bandwidth-delay product window, in bytes.
404This value should not generally be modified, but may be used to set a
405global per-connection limit on queued data, potentially allowing you to
406intentionally set a less than optimum limit, to smooth data flow over a
407network while still being able to specify huge internal
408.Tn TCP
409buffers.
410.It Va inflight.stab
411The bandwidth-delay product algorithm requires a slightly larger window
412than it otherwise calculates for stability.
413This parameter determines the extra window in maximal packets / 10.
414The default value of 20 represents 2 maximal packets.
415Reducing this value is not recommended, but you may
416come across a situation with very slow links where the
417.Xr ping 8
418time
419reduction of the default inflight code is not sufficient.
420If this case occurs, you should first try reducing
421.Va inflight.min
422and, if that does not
423work, reduce both
424.Va inflight.min
425and
426.Va inflight.stab ,
427trying values of
42815, 10, or 5 for the latter.
429Never use a value less than 5.
430Reducing
431.Va inflight.stab
432can lead to upwards of a 20% underutilization of the link
433as well as reducing the algorithm's ability to adapt to changing
434situations and should only be done as a last resort.
435.It Va rfc3042
436Enable the Limited Transmit algorithm as described in RFC 3042.
437It helps avoid timeouts on lossy links and also when the congestion window
438is small, as happens on short transfers.
439.It Va rfc3390
440Enable support for RFC 3390, which allows for a variable-sized
441starting congestion window on new connections, depending on the
442maximum segment size.
443This helps throughput in general, but
444particularly affects short transfers and high-bandwidth large
445propagation-delay connections.
446.Pp
447When this feature is enabled, the
448.Va slowstart_flightsize
449and
450.Va local_slowstart_flightsize
451settings are not observed for new
452connection slow starts, but they are still used for slow starts
453that occur when the connection has been idle and starts sending
454again.
455.It Va sack.enable
456Enable support for RFC 2018, TCP Selective Acknowledgment option,
457which allows the receiver to inform the sender about all successfully
458arrived segments, allowing the sender to retransmit the missing segments
459only.
460.It Va sack.initburst
461Control the number of SACK retransmissions done upon initiation of SACK
462recovery.
463.El
464.Sh ERRORS
465A socket operation may fail with one of the following errors returned:
466.Bl -tag -width Er
467.It Bq Er EISCONN
468when trying to establish a connection on a socket which
469already has one;
470.It Bq Er ENOBUFS
471when the system runs out of memory for
472an internal data structure;
473.It Bq Er ETIMEDOUT
474when a connection was dropped
475due to excessive retransmissions;
476.It Bq Er ECONNRESET
477when the remote peer
478forces the connection to be closed;
479.It Bq Er ECONNREFUSED
480when the remote
481peer actively refuses connection establishment (usually because
482no process is listening to the port);
483.It Bq Er EADDRINUSE
484when an attempt
485is made to create a socket with a port which has already been
486allocated;
487.It Bq Er EADDRNOTAVAIL
488when an attempt is made to create a
489socket with a network address for which no network interface
490exists;
491.It Bq Er EAFNOSUPPORT
492when an attempt is made to bind or connect a socket to a multicast
493address.
494.El
495.Sh SEE ALSO
496.Xr getsockopt 2 ,
497.Xr socket 2 ,
498.Xr sysctl 3 ,
499.Xr blackhole 4 ,
500.Xr inet 4 ,
501.Xr intro 4 ,
502.Xr ip 4 ,
503.Xr syncache 4 ,
504.Xr setkey 8
505.Rs
506.%A "V. Jacobson"
507.%A "R. Braden"
508.%A "D. Borman"
509.%T "TCP Extensions for High Performance"
510.%O "RFC 1323"
511.Re
512.Rs
513.%A "A. Heffernan"
514.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
515.%O "RFC 2385"
516.Re
517.Sh HISTORY
518The
519.Tn TCP
520protocol appeared in
521.Bx 4.2 .
522The RFC 1323 extensions for window scaling and timestamps were added
523in
524.Bx 4.4 .
525