xref: /freebsd/share/man/man4/tcp.4 (revision 8ab2f5ecc596131f6ca790d6ae35540c06ed7985)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD$
34.\"
35.Dd July 10, 2004
36.Dt TCP 4
37.Os
38.Sh NAME
39.Nm tcp
40.Nd Internet Transmission Control Protocol
41.Sh SYNOPSIS
42.In sys/types.h
43.In sys/socket.h
44.In netinet/in.h
45.Ft int
46.Fn socket AF_INET SOCK_STREAM 0
47.Sh DESCRIPTION
48The
49.Tn TCP
50protocol provides reliable, flow-controlled, two-way
51transmission of data.
52It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.
56.Tn TCP
57uses the standard
58Internet address format and, in addition, provides a per-host
59collection of
60.Dq "port addresses" .
61Thus, each address is composed
62of an Internet address specifying the host and network,
63with a specific
64.Tn TCP
65port on the host identifying the peer entity.
66.Pp
67Sockets utilizing the
68.Tn TCP
69protocol are either
70.Dq active
71or
72.Dq passive .
73Active sockets initiate connections to passive
74sockets.
75By default,
76.Tn TCP
77sockets are created active; to create a
78passive socket, the
79.Xr listen 2
80system call must be used
81after binding the socket with the
82.Xr bind 2
83system call.
84Only passive sockets may use the
85.Xr accept 2
86call to accept incoming connections.
87Only active sockets may use the
88.Xr connect 2
89call to initiate connections.
90.Tn TCP
91also supports a more datagram-like mode, called Transaction
92.Tn TCP ,
93which is described in
94.Xr ttcp 4 .
95.Pp
96Passive sockets may
97.Dq underspecify
98their location to match
99incoming connection requests from multiple networks.
100This technique, termed
101.Dq "wildcard addressing" ,
102allows a single
103server to provide service to clients on multiple networks.
104To create a socket which listens on all networks, the Internet
105address
106.Dv INADDR_ANY
107must be bound.
108The
109.Tn TCP
110port may still be specified
111at this time; if the port is not specified, the system will assign one.
112Once a connection has been established, the socket's address is
113fixed by the peer entity's location.
114The address assigned to the
115socket is the address associated with the network interface
116through which packets are being transmitted and received.
117Normally, this address corresponds to the peer entity's network.
118.Pp
119.Tn TCP
120supports a number of socket options which can be set with
121.Xr setsockopt 2
122and tested with
123.Xr getsockopt 2 :
124.Bl -tag -width ".Dv TCP_NODELAY"
125.It Dv TCP_NODELAY
126Under most circumstances,
127.Tn TCP
128sends data when it is presented;
129when outstanding data has not yet been acknowledged, it gathers
130small amounts of output to be sent in a single packet once
131an acknowledgement is received.
132For a small number of clients, such as window systems
133that send a stream of mouse events which receive no replies,
134this packetization may cause significant delays.
135The boolean option
136.Dv TCP_NODELAY
137defeats this algorithm.
138.It Dv TCP_MAXSEG
139By default, a sender- and
140.No receiver- Ns Tn TCP
141will negotiate among themselves to determine the maximum segment size
142to be used for each connection.
143The
144.Dv TCP_MAXSEG
145option allows the user to determine the result of this negotiation,
146and to reduce it if desired.
147.It Dv TCP_NOOPT
148.Tn TCP
149usually sends a number of options in each packet, corresponding to
150various
151.Tn TCP
152extensions which are provided in this implementation.
153The boolean option
154.Dv TCP_NOOPT
155is provided to disable
156.Tn TCP
157option use on a per-connection basis.
158.It Dv TCP_NOPUSH
159By convention, the
160.No sender- Ns Tn TCP
161will set the
162.Dq push
163bit, and begin transmission immediately (if permitted) at the end of
164every user call to
165.Xr write 2
166or
167.Xr writev 2 .
168The
169.Dv TCP_NOPUSH
170option is provided to allow servers to easily make use of Transaction
171.Tn TCP
172(see
173.Xr ttcp 4 ) .
174When this option is set to a non-zero value,
175.Tn TCP
176will delay sending any data at all until either the socket is closed,
177or the internal send buffer is filled.
178.It Dv TCP_MD5SIG
179This option enables the use of MD5 digests (also known as TCP-MD5)
180on writes to the specified socket.
181In the current release, only outgoing traffic is digested;
182digests on incoming traffic are not verified.
183The current default behavior for the system is to respond to a system
184advertising this option with TCP-MD5; this may change.
185.Pp
186One common use for this in a
187.Fx
188router deployment is to enable
189based routers to interwork with Cisco equipment at peering points.
190Support for this feature conforms to RFC 2385.
191Only IPv4
192.Pq Dv AF_INET
193sessions are supported.
194.Pp
195In order for this option to function correctly, it is necessary for the
196administrator to add a tcp-md5 key entry to the system's security
197associations database (SADB) using the
198.Xr setkey 8
199utility.
200This entry must have an SPI of 0x1000 and can therefore only be specified
201on a per-host basis at this time.
202.Pp
203If an SADB entry cannot be found for the destination, the outgoing traffic
204will have an invalid digest option prepended, and the following error message
205will be visible on the system console:
206.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
207.El
208.Pp
209The option level for the
210.Xr setsockopt 2
211call is the protocol number for
212.Tn TCP ,
213available from
214.Xr getprotobyname 3 ,
215or
216.Dv IPPROTO_TCP .
217All options are declared in
218.In netinet/tcp.h .
219.Pp
220Options at the
221.Tn IP
222transport level may be used with
223.Tn TCP ;
224see
225.Xr ip 4 .
226Incoming connection requests that are source-routed are noted,
227and the reverse source route is used in responding.
228.Ss MIB Variables
229The
230.Tn TCP
231protocol implements a number of variables in the
232.Va net.inet.tcp
233branch of the
234.Xr sysctl 3
235MIB.
236.Bl -tag -width ".Va TCPCTL_DO_RFC1644"
237.It Dv TCPCTL_DO_RFC1323
238.Pq Va rfc1323
239Implement the window scaling and timestamp options of RFC 1323
240(default is true).
241.It Dv TCPCTL_DO_RFC1644
242.Pq Va rfc1644
243Implement Transaction
244.Tn TCP ,
245as described in RFC 1644.
246.It Dv TCPCTL_MSSDFLT
247.Pq Va mssdflt
248The default value used for the maximum segment size
249.Pq Dq MSS
250when no advice to the contrary is received from MSS negotiation.
251.It Dv TCPCTL_SENDSPACE
252.Pq Va sendspace
253Maximum
254.Tn TCP
255send window.
256.It Dv TCPCTL_RECVSPACE
257.Pq Va recvspace
258Maximum
259.Tn TCP
260receive window.
261.It Va log_in_vain
262Log any connection attempts to ports where there is not a socket
263accepting connections.
264The value of 1 limits the logging to
265.Tn SYN
266(connection establishment) packets only.
267That of 2 results in any
268.Tn TCP
269packets to closed ports being logged.
270Any value unlisted above disables the logging
271(default is 0, i.e., the logging is disabled).
272.It Va slowstart_flightsize
273The number of packets allowed to be in-flight during the
274.Tn TCP
275slow-start phase on a non-local network.
276.It Va local_slowstart_flightsize
277The number of packets allowed to be in-flight during the
278.Tn TCP
279slow-start phase to local machines in the same subnet.
280.It Va msl
281The Maximum Segment Lifetime, in milliseconds, for a packet.
282.It Va keepinit
283Timeout, in milliseconds, for new, non-established
284.Tn TCP
285connections.
286.It Va keepidle
287Amount of time, in milliseconds, that the connection must be idle
288before keepalive probes (if enabled) are sent.
289.It Va keepintvl
290The interval, in milliseconds, between keepalive probes sent to remote
291machines.
292After
293.Dv TCPTV_KEEPCNT
294(default 8) probes are sent, with no response, the connection is dropped.
295.It Va always_keepalive
296Assume that
297.Dv SO_KEEPALIVE
298is set on all
299.Tn TCP
300connections, the kernel will
301periodically send a packet to the remote host to verify the connection
302is still up.
303.It Va icmp_may_rst
304Certain
305.Tn ICMP
306unreachable messages may abort connections in
307.Tn SYN-SENT
308state.
309.It Va do_tcpdrain
310Flush packets in the
311.Tn TCP
312reassembly queue if the system is low on mbufs.
313.It Va blackhole
314If enabled, disable sending of RST when a connection is attempted
315to a port where there is not a socket accepting connections.
316See
317.Xr blackhole 4 .
318.It Va delayed_ack
319Delay ACK to try and piggyback it onto a data packet.
320.It Va delacktime
321Maximum amount of time, in milliseconds, before a delayed ACK is sent.
322.It Va newreno
323Enable
324.Tn TCP
325NewReno Fast Recovery algorithm,
326as described in RFC 2582.
327.It Va path_mtu_discovery
328Enable Path MTU Discovery.
329.It Va tcbhashsize
330Size of the
331.Tn TCP
332control-block hash table
333(read-only).
334This may be tuned using the kernel option
335.Dv TCBHASHSIZE
336or by setting
337.Va net.inet.tcp.tcbhashsize
338in the
339.Xr loader 8 .
340.It Va pcbcount
341Number of active process control blocks
342(read-only).
343.It Va syncookies
344Determines whether or not
345.Tn SYN
346cookies should be generated for outbound
347.Tn SYN-ACK
348packets.
349.Tn SYN
350cookies are a great help during
351.Tn SYN
352flood attacks, and are enabled by default.
353(See
354.Xr syncookies 4 . )
355.It Va isn_reseed_interval
356The interval (in seconds) specifying how often the secret data used in
357RFC 1948 initial sequence number calculations should be reseeded.
358By default, this variable is set to zero, indicating that
359no reseeding will occur.
360Reseeding should not be necessary, and will break
361.Dv TIME_WAIT
362recycling for a few minutes.
363.It Va rexmit_min , rexmit_slop
364Adjust the retransmit timer calculation for
365.Tn TCP .
366The slop is
367typically added to the raw calculation to take into account
368occasional variances that the
369.Tn SRTT
370(smoothed round-trip time)
371is unable to accommodate, while the minimum specifies an
372absolute minimum.
373While a number of
374.Tn TCP
375RFCs suggest a 1
376second minimum, these RFCs tend to focus on streaming behavior,
377and fail to deal with the fact that a 1 second minimum has severe
378detrimental effects over lossy interactive connections, such
379as a 802.11b wireless link, and over very fast but lossy
380connections for those cases not covered by the fast retransmit
381code.
382For this reason, we use 200ms of slop and a near-0
383minimum, which gives us an effective minimum of 200ms (similar to
384.Tn Linux ) .
385.It Va inflight_enable
386Enable
387.Tn TCP
388bandwidth-delay product limiting.
389An attempt will be made to calculate
390the bandwidth-delay product for each individual
391.Tn TCP
392connection, and limit
393the amount of inflight data being transmitted, to avoid building up
394unnecessary packets in the network.
395This option is recommended if you
396are serving a lot of data over connections with high bandwidth-delay
397products, such as modems, GigE links, and fast long-haul WANs, and/or
398you have configured your machine to accommodate large
399.Tn TCP
400windows.
401In such
402situations, without this option, you may experience high interactive
403latencies or packet loss due to the overloading of intermediate routers
404and switches.
405Note that bandwidth-delay product limiting only effects
406the transmit side of a
407.Tn TCP
408connection.
409.It Va inflight_debug
410Enable debugging for the bandwidth-delay product algorithm.
411This may
412default to on (1), so if you enable the algorithm,
413you should probably also
414disable debugging by setting this variable to 0.
415.It Va inflight_min
416This puts a lower bound on the bandwidth-delay product window, in bytes.
417A value of 1024 is typically used for debugging.
4186000-16000 is more typical in a production installation.
419Setting this value too low may result in
420slow ramp-up times for bursty connections.
421Setting this value too high effectively disables the algorithm.
422.It Va inflight_max
423This puts an upper bound on the bandwidth-delay product window, in bytes.
424This value should not generally be modified, but may be used to set a
425global per-connection limit on queued data, potentially allowing you to
426intentionally set a less than optimum limit, to smooth data flow over a
427network while still being able to specify huge internal
428.Tn TCP
429buffers.
430.It Va inflight_stab
431The bandwidth-delay product algorithm requires a slightly larger window
432than it otherwise calculates for stability.
433This parameter determines the extra window in maximal packets / 10.
434The default value of 20 represents 2 maximal packets.
435Reducing this value is not recommended, but you may
436come across a situation with very slow links where the
437.Xr ping 8
438time
439reduction of the default inflight code is not sufficient.
440If this case occurs, you should first try reducing
441.Va inflight_min
442and, if that does not
443work, reduce both
444.Va inflight_min
445and
446.Va inflight_stab ,
447trying values of
44815, 10, or 5 for the latter.
449Never use a value less than 5.
450Reducing
451.Va inflight_stab
452can lead to upwards of a 20% underutilization of the link
453as well as reducing the algorithm's ability to adapt to changing
454situations and should only be done as a last resort.
455.It Va rfc3042
456Enable the Limited Transmit algorithm as described in RFC 3042.
457It
458helps avoid timeouts on lossy links and also when the congestion window
459is small, as happens on short transfers.
460This is a standards track RFC
461and is off by default.
462.It Va rfc3390
463Enable support for RFC 3390, which allows for a variable-sized
464starting congestion window on new connections, depending on the
465maximum segment size.
466This helps throughput in general, but
467particularly affects short transfers and high-bandwidth large
468propagation-delay connections.
469This is a standards track RFC and
470support for it is off by default.
471.Pp
472When this feature is enabled, the
473.Va slowstart_flightsize
474and
475.Va local_slowstart_flightsize
476settings are not observed for new
477connection slow starts, but they are still used for slow starts
478that occur when the connection has been idle and starts sending
479again.
480.It Va sack.enable
481Enable support for RFC 2018, TCP Selective Acknowledgment option,
482which allows the receiver to inform the sender about all successfully
483arrived segments, allowing the sender to retransmit the missing segments
484only.
485.El
486.Sh ERRORS
487A socket operation may fail with one of the following errors returned:
488.Bl -tag -width Er
489.It Bq Er EISCONN
490when trying to establish a connection on a socket which
491already has one;
492.It Bq Er ENOBUFS
493when the system runs out of memory for
494an internal data structure;
495.It Bq Er ETIMEDOUT
496when a connection was dropped
497due to excessive retransmissions;
498.It Bq Er ECONNRESET
499when the remote peer
500forces the connection to be closed;
501.It Bq Er ECONNREFUSED
502when the remote
503peer actively refuses connection establishment (usually because
504no process is listening to the port);
505.It Bq Er EADDRINUSE
506when an attempt
507is made to create a socket with a port which has already been
508allocated;
509.It Bq Er EADDRNOTAVAIL
510when an attempt is made to create a
511socket with a network address for which no network interface
512exists;
513.It Bq Er EAFNOSUPPORT
514when an attempt is made to bind or connect a socket to a multicast
515address.
516.El
517.Sh SEE ALSO
518.Xr getsockopt 2 ,
519.Xr socket 2 ,
520.Xr sysctl 3 ,
521.Xr blackhole 4 ,
522.Xr inet 4 ,
523.Xr intro 4 ,
524.Xr ip 4 ,
525.Xr syncache 4 ,
526.Xr ttcp 4 ,
527.Xr setkey 8
528.Rs
529.%A "V. Jacobson"
530.%A "R. Braden"
531.%A "D. Borman"
532.%T "TCP Extensions for High Performance"
533.%O "RFC 1323"
534.Re
535.Rs
536.%A "R. Braden"
537.%T "T/TCP - TCP Extensions for Transactions"
538.%O "RFC 1644"
539.Re
540.Rs
541.%A "A. Heffernan"
542.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
543.%O "RFC 2385"
544.Re
545.Sh HISTORY
546The
547.Tn TCP
548protocol appeared in
549.Bx 4.2 .
550The RFC 1323 extensions for window scaling and timestamps were added
551in
552.Bx 4.4 .
553