xref: /freebsd/share/man/man4/tcp.4 (revision 2357939bc239bd5334a169b62313806178dd8f30)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD$
34.\"
35.Dd March 13, 2003
36.Dt TCP 4
37.Os
38.Sh NAME
39.Nm tcp
40.Nd Internet Transmission Control Protocol
41.Sh SYNOPSIS
42.In sys/types.h
43.In sys/socket.h
44.In netinet/in.h
45.Ft int
46.Fn socket AF_INET SOCK_STREAM 0
47.Sh DESCRIPTION
48The
49.Tn TCP
50protocol provides reliable, flow-controlled, two-way
51transmission of data.
52It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.
56.Tn TCP
57uses the standard
58Internet address format and, in addition, provides a per-host
59collection of
60.Dq "port addresses" .
61Thus, each address is composed
62of an Internet address specifying the host and network,
63with a specific
64.Tn TCP
65port on the host identifying the peer entity.
66.Pp
67Sockets utilizing the
68.Tn TCP
69protocol are either
70.Dq active
71or
72.Dq passive .
73Active sockets initiate connections to passive
74sockets.
75By default,
76.Tn TCP
77sockets are created active; to create a
78passive socket, the
79.Xr listen 2
80system call must be used
81after binding the socket with the
82.Xr bind 2
83system call.
84Only passive sockets may use the
85.Xr accept 2
86call to accept incoming connections.
87Only active sockets may use the
88.Xr connect 2
89call to initiate connections.
90.Tn TCP
91also supports a more datagram-like mode, called Transaction
92.Tn TCP ,
93which is described in
94.Xr ttcp 4 .
95.Pp
96Passive sockets may
97.Dq underspecify
98their location to match
99incoming connection requests from multiple networks.
100This technique, termed
101.Dq "wildcard addressing" ,
102allows a single
103server to provide service to clients on multiple networks.
104To create a socket which listens on all networks, the Internet
105address
106.Dv INADDR_ANY
107must be bound.
108The
109.Tn TCP
110port may still be specified
111at this time; if the port is not specified, the system will assign one.
112Once a connection has been established, the socket's address is
113fixed by the peer entity's location.
114The address assigned to the
115socket is the address associated with the network interface
116through which packets are being transmitted and received.
117Normally, this address corresponds to the peer entity's network.
118.Pp
119.Tn TCP
120supports a number of socket options which can be set with
121.Xr setsockopt 2
122and tested with
123.Xr getsockopt 2 :
124.Bl -tag -width ".Dv TCP_MD5SIG"
125.It Dv TCP_NODELAY
126Under most circumstances,
127.Tn TCP
128sends data when it is presented;
129when outstanding data has not yet been acknowledged, it gathers
130small amounts of output to be sent in a single packet once
131an acknowledgement is received.
132For a small number of clients, such as window systems
133that send a stream of mouse events which receive no replies,
134this packetization may cause significant delays.
135The boolean option
136.Dv TCP_NODELAY
137defeats this algorithm.
138.It Dv TCP_MAXSEG
139By default, a sender- and
140.No receiver- Ns Tn TCP
141will negotiate among themselves to determine the maximum segment size
142to be used for each connection.
143The
144.Dv TCP_MAXSEG
145option allows the user to determine the result of this negotiation,
146and to reduce it if desired.
147.It Dv TCP_NOOPT
148.Tn TCP
149usually sends a number of options in each packet, corresponding to
150various
151.Tn TCP
152extensions which are provided in this implementation.
153The boolean option
154.Dv TCP_NOOPT
155is provided to disable
156.Tn TCP
157option use on a per-connection basis.
158.It Dv TCP_NOPUSH
159By convention, the
160.No sender- Ns Tn TCP
161will set the
162.Dq push
163bit, and begin transmission immediately (if permitted) at the end of
164every user call to
165.Xr write 2
166or
167.Xr writev 2 .
168The
169.Dv TCP_NOPUSH
170option is provided to allow servers to easily make use of Transaction
171.Tn TCP
172(see
173.Xr ttcp 4 ) .
174When this option is set to a non-zero value,
175.Tn TCP
176will delay sending any data at all until either the socket is closed,
177or the internal send buffer is filled.
178.It Dv TCP_MD5SIG
179This option enables the use of MD5 digests (also known as TCP-MD5)
180on writes to the specified socket.
181In the current release, only outgoing traffic is digested;
182digests on incoming traffic are not verified.
183The current default behavior for the system is to respond to a system
184advertising this option with TCP-MD5; this may change.
185.Pp
186One common use for this in a FreeBSD router deployment is to enable
187based routers to interwork with Cisco equipment at peering points.
188Support for this feature conforms to RFC 2385.
189Only IPv4 (AF_INET) sessions are supported.
190.Pp
191In order for this option to function correctly, it is necessary for the
192administrator to add a tcp-md5 key entry to the system's security
193associations database (SADB) using the
194.Xr setkey 8
195utility.
196This entry must have an SPI of 0x1000 and can therefore only be specified
197on a per-host basis at this time.
198.Pp
199If an SADB entry cannot be found for the destination, the outgoing traffic
200will have an invalid digest option prepended, and the following error message
201will be visible on the system console:
202.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
203.El
204.Pp
205The option level for the
206.Xr setsockopt 2
207call is the protocol number for
208.Tn TCP ,
209available from
210.Xr getprotobyname 3 ,
211or
212.Dv IPPROTO_TCP .
213All options are declared in
214.In netinet/tcp.h .
215.Pp
216Options at the
217.Tn IP
218transport level may be used with
219.Tn TCP ;
220see
221.Xr ip 4 .
222Incoming connection requests that are source-routed are noted,
223and the reverse source route is used in responding.
224.Ss MIB Variables
225The
226.Tn TCP
227protocol implements a number of variables in the
228.Va net.inet.tcp
229branch of the
230.Xr sysctl 3
231MIB.
232.Bl -tag -width ".Va TCPCTL_DO_RFC1644"
233.It Dv TCPCTL_DO_RFC1323
234.Pq Va rfc1323
235Implement the window scaling and timestamp options of RFC 1323
236(default is true).
237.It Dv TCPCTL_DO_RFC1644
238.Pq Va rfc1644
239Implement Transaction
240.Tn TCP ,
241as described in RFC 1644.
242.It Dv TCPCTL_MSSDFLT
243.Pq Va mssdflt
244The default value used for the maximum segment size
245.Pq Dq MSS
246when no advice to the contrary is received from MSS negotiation.
247.It Dv TCPCTL_SENDSPACE
248.Pq Va sendspace
249Maximum
250.Tn TCP
251send window.
252.It Dv TCPCTL_RECVSPACE
253.Pq Va recvspace
254Maximum
255.Tn TCP
256receive window.
257.It Va log_in_vain
258Log any connection attempts to ports where there is not a socket
259accepting connections.
260The value of 1 limits the logging to
261.Tn SYN
262(connection establishment) packets only.
263That of 2 results in any
264.Tn TCP
265packets to closed ports being logged.
266Any value unlisted above disables the logging
267(default is 0, i.e., the logging is disabled).
268.It Va slowstart_flightsize
269The number of packets allowed to be in-flight during the
270.Tn TCP
271slow-start phase on a non-local network.
272.It Va local_slowstart_flightsize
273The number of packets allowed to be in-flight during the
274.Tn TCP
275slow-start phase to local machines in the same subnet.
276.It Va msl
277The Maximum Segment Lifetime, in milliseconds, for a packet.
278.It Va keepinit
279Timeout, in milliseconds, for new, non-established
280.Tn TCP
281connections.
282.It Va keepidle
283Amount of time, in milliseconds, that the connection must be idle
284before keepalive probes (if enabled) are sent.
285.It Va keepintvl
286The interval, in milliseconds, between keepalive probes sent to remote
287machines.
288After
289.Dv TCPTV_KEEPCNT
290(default 8) probes are sent, with no response, the connection is dropped.
291.It Va always_keepalive
292Assume that
293.Dv SO_KEEPALIVE
294is set on all
295.Tn TCP
296connections, the kernel will
297periodically send a packet to the remote host to verify the connection
298is still up.
299.It Va icmp_may_rst
300Certain
301.Tn ICMP
302unreachable messages may abort connections in
303.Tn SYN-SENT
304state.
305.It Va do_tcpdrain
306Flush packets in the
307.Tn TCP
308reassembly queue if the system is low on mbufs.
309.It Va blackhole
310If enabled, disable sending of RST when a connection is attempted
311to a port where there is not a socket accepting connections.
312See
313.Xr blackhole 4 .
314.It Va delayed_ack
315Delay ACK to try and piggyback it onto a data packet.
316.It Va delacktime
317Maximum amount of time, in milliseconds, before a delayed ACK is sent.
318.It Va newreno
319Enable
320.Tn TCP
321NewReno Fast Recovery algorithm,
322as described in RFC 2582.
323.It Va path_mtu_discovery
324Enable Path MTU Discovery.
325.It Va tcbhashsize
326Size of the
327.Tn TCP
328control-block hash table
329(read-only).
330This may be tuned using the kernel option
331.Dv TCBHASHSIZE
332or by setting
333.Va net.inet.tcp.tcbhashsize
334in the
335.Xr loader 8 .
336.It Va pcbcount
337Number of active process control blocks
338(read-only).
339.It Va syncookies
340Determines whether or not
341.Tn SYN
342cookies should be generated for outbound
343.Tn SYN-ACK
344packets.
345.Tn SYN
346cookies are a great help during
347.Tn SYN
348flood attacks, and are enabled by default.
349(See
350.Xr syncookies 4 . )
351.It Va isn_reseed_interval
352The interval (in seconds) specifying how often the secret data used in
353RFC 1948 initial sequence number calculations should be reseeded.
354By default, this variable is set to zero, indicating that
355no reseeding will occur.
356Reseeding should not be necessary, and will break
357.Dv TIME_WAIT
358recycling for a few minutes.
359.It Va rexmit_min , rexmit_slop
360Adjust the retransmit timer calculation for
361.Tn TCP .
362The slop is
363typically added to the raw calculation to take into account
364occasional variances that the
365.Tn SRTT
366(smoothed round-trip time)
367is unable to accomodate, while the minimum specifies an
368absolute minimum.
369While a number of
370.Tn TCP
371RFCs suggest a 1
372second minimum, these RFCs tend to focus on streaming behavior,
373and fail to deal with the fact that a 1 second minimum has severe
374detrimental effects over lossy interactive connections, such
375as a 802.11b wireless link, and over very fast but lossy
376connections for those cases not covered by the fast retransmit
377code.
378For this reason, we use 200ms of slop and a near-0
379minimum, which gives us an effective minimum of 200ms (similar to
380.Tn Linux ) .
381.It Va inflight_enable
382Enable
383.Tn TCP
384bandwidth-delay product limiting.
385An attempt will be made to calculate
386the bandwidth-delay product for each individual
387.Tn TCP
388connection, and limit
389the amount of inflight data being transmitted, to avoid building up
390unnecessary packets in the network.
391This option is recommended if you
392are serving a lot of data over connections with high bandwidth-delay
393products, such as modems, GigE links, and fast long-haul WANs, and/or
394you have configured your machine to accomodate large
395.Tn TCP
396windows.
397In such
398situations, without this option, you may experience high interactive
399latencies or packet loss due to the overloading of intermediate routers
400and switches.
401Note that bandwidth-delay product limiting only effects
402the transmit side of a
403.Tn TCP
404connection.
405.It Va inflight_debug
406Enable debugging for the bandwidth-delay product algorithm.
407This may
408default to on (1), so if you enable the algorithm,
409you should probably also
410disable debugging by setting this variable to 0.
411.It Va inflight_min
412This puts a lower bound on the bandwidth-delay product window, in bytes.
413A value of 1024 is typically used for debugging.
4146000-16000 is more typical in a production installation.
415Setting this value too low may result in
416slow ramp-up times for bursty connections.
417Setting this value too high effectively disables the algorithm.
418.It Va inflight_max
419This puts an upper bound on the bandwidth-delay product window, in bytes.
420This value should not generally be modified, but may be used to set a
421global per-connection limit on queued data, potentially allowing you to
422intentionally set a less than optimum limit, to smooth data flow over a
423network while still being able to specify huge internal
424.Tn TCP
425buffers.
426.It Va inflight_stab
427The bandwidth-delay product algorithm requires a slightly larger window
428than it otherwise calculates for stability.
429This parameter determines the extra window in maximal packets / 10.
430The default value of 20 represents 2 maximal packets.
431Reducing this value is not recommended, but you may
432come across a situation with very slow links where the
433.Xr ping 8
434time
435reduction of the default inflight code is not sufficient.
436If this case occurs, you should first try reducing
437.Va inflight_min
438and, if that does not
439work, reduce both
440.Va inflight_min
441and
442.Va inflight_stab ,
443trying values of
44415, 10, or 5 for the latter.
445Never use a value less than 5.
446Reducing
447.Va inflight_stab
448can lead to upwards of a 20% underutilization of the link
449as well as reducing the algorithm's ability to adapt to changing
450situations and should only be done as a last resort.
451.It Va rfc3042
452Enable the Limited Transmit algorithm as described in RFC 3042.
453It
454helps avoid timeouts on lossy links and also when the congestion window
455is small, as happens on short transfers.
456This is a standards track RFC
457and is off by default.
458.It Va rfc3390
459Enable support for RFC 3390, which allows for a variable-sized
460starting congestion window on new connections, depending on the
461maximum segment size.
462This helps throughput in general, but
463particularly affects short transfers and high-bandwidth large
464propagation-delay connections.
465This is a standards track RFC and
466support for it is off by default.
467.Pp
468When this feature is enabled, the
469.Va slowstart_flightsize
470and
471.Va local_slowstart_flightsize
472settings are not observed for new
473connection slow starts, but they are still used for slow starts
474that occur when the connection has been idle and starts sending
475again.
476.El
477.Sh ERRORS
478A socket operation may fail with one of the following errors returned:
479.Bl -tag -width Er
480.It Bq Er EISCONN
481when trying to establish a connection on a socket which
482already has one;
483.It Bq Er ENOBUFS
484when the system runs out of memory for
485an internal data structure;
486.It Bq Er ETIMEDOUT
487when a connection was dropped
488due to excessive retransmissions;
489.It Bq Er ECONNRESET
490when the remote peer
491forces the connection to be closed;
492.It Bq Er ECONNREFUSED
493when the remote
494peer actively refuses connection establishment (usually because
495no process is listening to the port);
496.It Bq Er EADDRINUSE
497when an attempt
498is made to create a socket with a port which has already been
499allocated;
500.It Bq Er EADDRNOTAVAIL
501when an attempt is made to create a
502socket with a network address for which no network interface
503exists;
504.It Bq Er EAFNOSUPPORT
505when an attempt is made to bind or connect a socket to a multicast
506address.
507.El
508.Sh SEE ALSO
509.Xr getsockopt 2 ,
510.Xr socket 2 ,
511.Xr sysctl 3 ,
512.Xr blackhole 4 ,
513.Xr inet 4 ,
514.Xr intro 4 ,
515.Xr ip 4 ,
516.Xr syncache 4 ,
517.Xr ttcp 4 ,
518.Xr setkey 8
519.Rs
520.%A "V. Jacobson"
521.%A "R. Braden"
522.%A "D. Borman"
523.%T "TCP Extensions for High Performance"
524.%O "RFC 1323"
525.Re
526.Rs
527.%A "R. Braden"
528.%T "T/TCP \- TCP Extensions for Transactions"
529.%O "RFC 1644"
530.Re
531.Rs
532.%A "A. Heffernan"
533.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
534.%O "RFC 2385"
535.Re
536.Sh HISTORY
537The
538.Tn TCP
539protocol appeared in
540.Bx 4.2 .
541The RFC 1323 extensions for window scaling and timestamps were added
542in
543.Bx 4.4 .
544