xref: /freebsd/share/man/man4/tcp.4 (revision 7660b554bc59a07be0431c17e0e33815818baa69)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD$
34.\"
35.Dd March 13, 2003
36.Dt TCP 4
37.Os
38.Sh NAME
39.Nm tcp
40.Nd Internet Transmission Control Protocol
41.Sh SYNOPSIS
42.In sys/types.h
43.In sys/socket.h
44.In netinet/in.h
45.Ft int
46.Fn socket AF_INET SOCK_STREAM 0
47.Sh DESCRIPTION
48The
49.Tn TCP
50protocol provides reliable, flow-controlled, two-way
51transmission of data.
52It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.
56.Tn TCP
57uses the standard
58Internet address format and, in addition, provides a per-host
59collection of
60.Dq "port addresses" .
61Thus, each address is composed
62of an Internet address specifying the host and network,
63with a specific
64.Tn TCP
65port on the host identifying the peer entity.
66.Pp
67Sockets utilizing the
68.Tn TCP
69protocol are either
70.Dq active
71or
72.Dq passive .
73Active sockets initiate connections to passive
74sockets.
75By default,
76.Tn TCP
77sockets are created active; to create a
78passive socket, the
79.Xr listen 2
80system call must be used
81after binding the socket with the
82.Xr bind 2
83system call.
84Only passive sockets may use the
85.Xr accept 2
86call to accept incoming connections.
87Only active sockets may use the
88.Xr connect 2
89call to initiate connections.
90.Tn TCP
91also supports a more datagram-like mode, called Transaction
92.Tn TCP ,
93which is described in
94.Xr ttcp 4 .
95.Pp
96Passive sockets may
97.Dq underspecify
98their location to match
99incoming connection requests from multiple networks.
100This technique, termed
101.Dq "wildcard addressing" ,
102allows a single
103server to provide service to clients on multiple networks.
104To create a socket which listens on all networks, the Internet
105address
106.Dv INADDR_ANY
107must be bound.
108The
109.Tn TCP
110port may still be specified
111at this time; if the port is not specified, the system will assign one.
112Once a connection has been established, the socket's address is
113fixed by the peer entity's location.
114The address assigned to the
115socket is the address associated with the network interface
116through which packets are being transmitted and received.
117Normally, this address corresponds to the peer entity's network.
118.Pp
119.Tn TCP
120supports a number of socket options which can be set with
121.Xr setsockopt 2
122and tested with
123.Xr getsockopt 2 :
124.Bl -tag -width ".Dv TCP_NODELAY"
125.It Dv TCP_NODELAY
126Under most circumstances,
127.Tn TCP
128sends data when it is presented;
129when outstanding data has not yet been acknowledged, it gathers
130small amounts of output to be sent in a single packet once
131an acknowledgement is received.
132For a small number of clients, such as window systems
133that send a stream of mouse events which receive no replies,
134this packetization may cause significant delays.
135The boolean option
136.Dv TCP_NODELAY
137defeats this algorithm.
138.It Dv TCP_MAXSEG
139By default, a sender- and
140.No receiver- Ns Tn TCP
141will negotiate among themselves to determine the maximum segment size
142to be used for each connection.
143The
144.Dv TCP_MAXSEG
145option allows the user to determine the result of this negotiation,
146and to reduce it if desired.
147.It Dv TCP_NOOPT
148.Tn TCP
149usually sends a number of options in each packet, corresponding to
150various
151.Tn TCP
152extensions which are provided in this implementation.
153The boolean option
154.Dv TCP_NOOPT
155is provided to disable
156.Tn TCP
157option use on a per-connection basis.
158.It Dv TCP_NOPUSH
159By convention, the
160.No sender- Ns Tn TCP
161will set the
162.Dq push
163bit, and begin transmission immediately (if permitted) at the end of
164every user call to
165.Xr write 2
166or
167.Xr writev 2 .
168The
169.Dv TCP_NOPUSH
170option is provided to allow servers to easily make use of Transaction
171.Tn TCP
172(see
173.Xr ttcp 4 ) .
174When this option is set to a non-zero value,
175.Tn TCP
176will delay sending any data at all until either the socket is closed,
177or the internal send buffer is filled.
178.El
179.Pp
180The option level for the
181.Xr setsockopt 2
182call is the protocol number for
183.Tn TCP ,
184available from
185.Xr getprotobyname 3 ,
186or
187.Dv IPPROTO_TCP .
188All options are declared in
189.In netinet/tcp.h .
190.Pp
191Options at the
192.Tn IP
193transport level may be used with
194.Tn TCP ;
195see
196.Xr ip 4 .
197Incoming connection requests that are source-routed are noted,
198and the reverse source route is used in responding.
199.Ss MIB Variables
200The
201.Tn TCP
202protocol implements a number of variables in the
203.Va net.inet.tcp
204branch of the
205.Xr sysctl 3
206MIB.
207.Bl -tag -width ".Va TCPCTL_DO_RFC1644"
208.It Dv TCPCTL_DO_RFC1323
209.Pq Va rfc1323
210Implement the window scaling and timestamp options of RFC 1323
211(default is true).
212.It Dv TCPCTL_DO_RFC1644
213.Pq Va rfc1644
214Implement Transaction
215.Tn TCP ,
216as described in RFC 1644.
217.It Dv TCPCTL_MSSDFLT
218.Pq Va mssdflt
219The default value used for the maximum segment size
220.Pq Dq MSS
221when no advice to the contrary is received from MSS negotiation.
222.It Dv TCPCTL_SENDSPACE
223.Pq Va sendspace
224Maximum
225.Tn TCP
226send window.
227.It Dv TCPCTL_RECVSPACE
228.Pq Va recvspace
229Maximum
230.Tn TCP
231receive window.
232.It Va log_in_vain
233Log any connection attempts to ports where there is not a socket
234accepting connections.
235The value of 1 limits the logging to
236.Tn SYN
237(connection establishment) packets only.
238That of 2 results in any
239.Tn TCP
240packets to closed ports being logged.
241Any value unlisted above disables the logging
242(default is 0, i.e., the logging is disabled).
243.It Va slowstart_flightsize
244The number of packets allowed to be in-flight during the
245.Tn TCP
246slow-start phase on a non-local network.
247.It Va local_slowstart_flightsize
248The number of packets allowed to be in-flight during the
249.Tn TCP
250slow-start phase to local machines in the same subnet.
251.It Va msl
252The Maximum Segment Lifetime, in milliseconds, for a packet.
253.It Va keepinit
254Timeout, in milliseconds, for new, non-established
255.Tn TCP
256connections.
257.It Va keepidle
258Amount of time, in milliseconds, that the connection must be idle
259before keepalive probes (if enabled) are sent.
260.It Va keepintvl
261The interval, in milliseconds, between keepalive probes sent to remote
262machines.
263After
264.Dv TCPTV_KEEPCNT
265(default 8) probes are sent, with no response, the connection is dropped.
266.It Va always_keepalive
267Assume that
268.Dv SO_KEEPALIVE
269is set on all
270.Tn TCP
271connections, the kernel will
272periodically send a packet to the remote host to verify the connection
273is still up.
274.It Va icmp_may_rst
275Certain
276.Tn ICMP
277unreachable messages may abort connections in
278.Tn SYN-SENT
279state.
280.It Va do_tcpdrain
281Flush packets in the
282.Tn TCP
283reassembly queue if the system is low on mbufs.
284.It Va blackhole
285If enabled, disable sending of RST when a connection is attempted
286to a port where there is not a socket accepting connections.
287See
288.Xr blackhole 4 .
289.It Va delayed_ack
290Delay ACK to try and piggyback it onto a data packet.
291.It Va delacktime
292Maximum amount of time, in milliseconds, before a delayed ACK is sent.
293.It Va newreno
294Enable
295.Tn TCP
296NewReno Fast Recovery algorithm,
297as described in RFC 2582.
298.It Va path_mtu_discovery
299Enable Path MTU Discovery.
300.It Va tcbhashsize
301Size of the
302.Tn TCP
303control-block hash table
304(read-only).
305This may be tuned using the kernel option
306.Dv TCBHASHSIZE
307or by setting
308.Va net.inet.tcp.tcbhashsize
309in the
310.Xr loader 8 .
311.It Va pcbcount
312Number of active process control blocks
313(read-only).
314.It Va syncookies
315Determines whether or not
316.Tn SYN
317cookies should be generated for outbound
318.Tn SYN-ACK
319packets.
320.Tn SYN
321cookies are a great help during
322.Tn SYN
323flood attacks, and are enabled by default.
324(See
325.Xr syncookies 4 . )
326.It Va isn_reseed_interval
327The interval (in seconds) specifying how often the secret data used in
328RFC 1948 initial sequence number calculations should be reseeded.
329By default, this variable is set to zero, indicating that
330no reseeding will occur.
331Reseeding should not be necessary, and will break
332.Dv TIME_WAIT
333recycling for a few minutes.
334.It Va rexmit_min , rexmit_slop
335Adjust the retransmit timer calculation for
336.Tn TCP .
337The slop is
338typically added to the raw calculation to take into account
339occasional variances that the
340.Tn SRTT
341(smoothed round-trip time)
342is unable to accomodate, while the minimum specifies an
343absolute minimum.
344While a number of
345.Tn TCP
346RFCs suggest a 1
347second minimum, these RFCs tend to focus on streaming behavior,
348and fail to deal with the fact that a 1 second minimum has severe
349detrimental effects over lossy interactive connections, such
350as a 802.11b wireless link, and over very fast but lossy
351connections for those cases not covered by the fast retransmit
352code.
353For this reason, we use 200ms of slop and a near-0
354minimum, which gives us an effective minimum of 200ms (similar to
355.Tn Linux ) .
356.It Va inflight_enable
357Enable
358.Tn TCP
359bandwidth-delay product limiting.
360An attempt will be made to calculate
361the bandwidth-delay product for each individual
362.Tn TCP
363connection, and limit
364the amount of inflight data being transmitted, to avoid building up
365unnecessary packets in the network.
366This option is recommended if you
367are serving a lot of data over connections with high bandwidth-delay
368products, such as modems, GigE links, and fast long-haul WANs, and/or
369you have configured your machine to accomodate large
370.Tn TCP
371windows.
372In such
373situations, without this option, you may experience high interactive
374latencies or packet loss due to the overloading of intermediate routers
375and switches.
376Note that bandwidth-delay product limiting only effects
377the transmit side of a
378.Tn TCP
379connection.
380.It Va inflight_debug
381Enable debugging for the bandwidth-delay product algorithm.
382This may
383default to on (1), so if you enable the algorithm,
384you should probably also
385disable debugging by setting this variable to 0.
386.It Va inflight_min
387This puts a lower bound on the bandwidth-delay product window, in bytes.
388A value of 1024 is typically used for debugging.
3896000-16000 is more typical in a production installation.
390Setting this value too low may result in
391slow ramp-up times for bursty connections.
392Setting this value too high effectively disables the algorithm.
393.It Va inflight_max
394This puts an upper bound on the bandwidth-delay product window, in bytes.
395This value should not generally be modified, but may be used to set a
396global per-connection limit on queued data, potentially allowing you to
397intentionally set a less than optimum limit, to smooth data flow over a
398network while still being able to specify huge internal
399.Tn TCP
400buffers.
401.It Va inflight_stab
402The bandwidth-delay product algorithm requires a slightly larger window
403than it otherwise calculates for stability.
404This parameter determines the extra window in maximal packets / 10.
405The default value of 20 represents 2 maximal packets.
406Reducing this value is not recommended, but you may
407come across a situation with very slow links where the
408.Xr ping 8
409time
410reduction of the default inflight code is not sufficient.
411If this case occurs, you should first try reducing
412.Va inflight_min
413and, if that does not
414work, reduce both
415.Va inflight_min
416and
417.Va inflight_stab ,
418trying values of
41915, 10, or 5 for the latter.
420Never use a value less than 5.
421Reducing
422.Va inflight_stab
423can lead to upwards of a 20% underutilization of the link
424as well as reducing the algorithm's ability to adapt to changing
425situations and should only be done as a last resort.
426.It Va rfc3042
427Enable the Limited Transmit algorithm as described in RFC 3042.
428It
429helps avoid timeouts on lossy links and also when the congestion window
430is small, as happens on short transfers.
431This is a standards track RFC
432and is off by default.
433.It Va rfc3390
434Enable support for RFC 3390, which allows for a variable-sized
435starting congestion window on new connections, depending on the
436maximum segment size.
437This helps throughput in general, but
438particularly affects short transfers and high-bandwidth large
439propagation-delay connections.
440This is a standards track RFC and
441support for it is off by default.
442.Pp
443When this feature is enabled, the
444.Va slowstart_flightsize
445and
446.Va local_slowstart_flightsize
447settings are not observed for new
448connection slow starts, but they are still used for slow starts
449that occur when the connection has been idle and starts sending
450again.
451.El
452.Sh ERRORS
453A socket operation may fail with one of the following errors returned:
454.Bl -tag -width Er
455.It Bq Er EISCONN
456when trying to establish a connection on a socket which
457already has one;
458.It Bq Er ENOBUFS
459when the system runs out of memory for
460an internal data structure;
461.It Bq Er ETIMEDOUT
462when a connection was dropped
463due to excessive retransmissions;
464.It Bq Er ECONNRESET
465when the remote peer
466forces the connection to be closed;
467.It Bq Er ECONNREFUSED
468when the remote
469peer actively refuses connection establishment (usually because
470no process is listening to the port);
471.It Bq Er EADDRINUSE
472when an attempt
473is made to create a socket with a port which has already been
474allocated;
475.It Bq Er EADDRNOTAVAIL
476when an attempt is made to create a
477socket with a network address for which no network interface
478exists;
479.It Bq Er EAFNOSUPPORT
480when an attempt is made to bind or connect a socket to a multicast
481address.
482.El
483.Sh SEE ALSO
484.Xr getsockopt 2 ,
485.Xr socket 2 ,
486.Xr sysctl 3 ,
487.Xr blackhole 4 ,
488.Xr inet 4 ,
489.Xr intro 4 ,
490.Xr ip 4 ,
491.Xr syncache 4 ,
492.Xr ttcp 4
493.Rs
494.%A "V. Jacobson"
495.%A "R. Braden"
496.%A "D. Borman"
497.%T "TCP Extensions for High Performance"
498.%O "RFC 1323"
499.Re
500.Rs
501.%A "R. Braden"
502.%T "T/TCP \- TCP Extensions for Transactions"
503.%O "RFC 1644"
504.Re
505.Sh HISTORY
506The
507.Tn TCP
508protocol appeared in
509.Bx 4.2 .
510The RFC 1323 extensions for window scaling and timestamps were added
511in
512.Bx 4.4 .
513