xref: /freebsd/share/man/man4/tcp.4 (revision 70e0bbedef95258a4dadc996d641a9bebd3f107d)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.
3.\" Copyright (c) 2010-2011 The FreeBSD Foundation
4.\" All rights reserved.
5.\"
6.\" Portions of this documentation were written at the Centre for Advanced
7.\" Internet Architectures, Swinburne University of Technology, Melbourne,
8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
39.\" $FreeBSD$
40.\"
41.Dd November 14, 2011
42.Dt TCP 4
43.Os
44.Sh NAME
45.Nm tcp
46.Nd Internet Transmission Control Protocol
47.Sh SYNOPSIS
48.In sys/types.h
49.In sys/socket.h
50.In netinet/in.h
51.Ft int
52.Fn socket AF_INET SOCK_STREAM 0
53.Sh DESCRIPTION
54The
55.Tn TCP
56protocol provides reliable, flow-controlled, two-way
57transmission of data.
58It is a byte-stream protocol used to
59support the
60.Dv SOCK_STREAM
61abstraction.
62.Tn TCP
63uses the standard
64Internet address format and, in addition, provides a per-host
65collection of
66.Dq "port addresses" .
67Thus, each address is composed
68of an Internet address specifying the host and network,
69with a specific
70.Tn TCP
71port on the host identifying the peer entity.
72.Pp
73Sockets utilizing the
74.Tn TCP
75protocol are either
76.Dq active
77or
78.Dq passive .
79Active sockets initiate connections to passive
80sockets.
81By default,
82.Tn TCP
83sockets are created active; to create a
84passive socket, the
85.Xr listen 2
86system call must be used
87after binding the socket with the
88.Xr bind 2
89system call.
90Only passive sockets may use the
91.Xr accept 2
92call to accept incoming connections.
93Only active sockets may use the
94.Xr connect 2
95call to initiate connections.
96.Pp
97Passive sockets may
98.Dq underspecify
99their location to match
100incoming connection requests from multiple networks.
101This technique, termed
102.Dq "wildcard addressing" ,
103allows a single
104server to provide service to clients on multiple networks.
105To create a socket which listens on all networks, the Internet
106address
107.Dv INADDR_ANY
108must be bound.
109The
110.Tn TCP
111port may still be specified
112at this time; if the port is not specified, the system will assign one.
113Once a connection has been established, the socket's address is
114fixed by the peer entity's location.
115The address assigned to the
116socket is the address associated with the network interface
117through which packets are being transmitted and received.
118Normally, this address corresponds to the peer entity's network.
119.Pp
120.Tn TCP
121supports a number of socket options which can be set with
122.Xr setsockopt 2
123and tested with
124.Xr getsockopt 2 :
125.Bl -tag -width ".Dv TCP_CONGESTION"
126.It Dv TCP_INFO
127Information about a socket's underlying TCP session may be retrieved
128by passing the read-only option
129.Dv TCP_INFO
130to
131.Xr getsockopt 2 .
132It accepts a single argument: a pointer to an instance of
133.Vt "struct tcp_info" .
134.Pp
135This API is subject to change; consult the source to determine
136which fields are currently filled out by this option.
137.Fx
138specific additions include
139send window size,
140receive window size,
141and
142bandwidth-controlled window space.
143.It Dv TCP_CONGESTION
144Select or query the congestion control algorithm that TCP will use for the
145connection.
146See
147.Xr mod_cc 4
148for details.
149.It Dv TCP_NODELAY
150Under most circumstances,
151.Tn TCP
152sends data when it is presented;
153when outstanding data has not yet been acknowledged, it gathers
154small amounts of output to be sent in a single packet once
155an acknowledgement is received.
156For a small number of clients, such as window systems
157that send a stream of mouse events which receive no replies,
158this packetization may cause significant delays.
159The boolean option
160.Dv TCP_NODELAY
161defeats this algorithm.
162.It Dv TCP_MAXSEG
163By default, a sender- and
164.No receiver- Ns Tn TCP
165will negotiate among themselves to determine the maximum segment size
166to be used for each connection.
167The
168.Dv TCP_MAXSEG
169option allows the user to determine the result of this negotiation,
170and to reduce it if desired.
171.It Dv TCP_NOOPT
172.Tn TCP
173usually sends a number of options in each packet, corresponding to
174various
175.Tn TCP
176extensions which are provided in this implementation.
177The boolean option
178.Dv TCP_NOOPT
179is provided to disable
180.Tn TCP
181option use on a per-connection basis.
182.It Dv TCP_NOPUSH
183By convention, the
184.No sender- Ns Tn TCP
185will set the
186.Dq push
187bit, and begin transmission immediately (if permitted) at the end of
188every user call to
189.Xr write 2
190or
191.Xr writev 2 .
192When this option is set to a non-zero value,
193.Tn TCP
194will delay sending any data at all until either the socket is closed,
195or the internal send buffer is filled.
196.It Dv TCP_MD5SIG
197This option enables the use of MD5 digests (also known as TCP-MD5)
198on writes to the specified socket.
199In the current release, only outgoing traffic is digested;
200digests on incoming traffic are not verified.
201The current default behavior for the system is to respond to a system
202advertising this option with TCP-MD5; this may change.
203.Pp
204One common use for this in a
205.Fx
206router deployment is to enable
207based routers to interwork with Cisco equipment at peering points.
208Support for this feature conforms to RFC 2385.
209Only IPv4
210.Pq Dv AF_INET
211sessions are supported.
212.Pp
213In order for this option to function correctly, it is necessary for the
214administrator to add a tcp-md5 key entry to the system's security
215associations database (SADB) using the
216.Xr setkey 8
217utility.
218This entry must have an SPI of 0x1000 and can therefore only be specified
219on a per-host basis at this time.
220.Pp
221If an SADB entry cannot be found for the destination, the outgoing traffic
222will have an invalid digest option prepended, and the following error message
223will be visible on the system console:
224.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
225.El
226.Pp
227The option level for the
228.Xr setsockopt 2
229call is the protocol number for
230.Tn TCP ,
231available from
232.Xr getprotobyname 3 ,
233or
234.Dv IPPROTO_TCP .
235All options are declared in
236.In netinet/tcp.h .
237.Pp
238Options at the
239.Tn IP
240transport level may be used with
241.Tn TCP ;
242see
243.Xr ip 4 .
244Incoming connection requests that are source-routed are noted,
245and the reverse source route is used in responding.
246.Pp
247The default congestion control algorithm for
248.Tn TCP
249is
250.Xr cc_newreno 4 .
251Other congestion control algorithms can be made available using the
252.Xr mod_cc 4
253framework.
254.Ss MIB Variables
255The
256.Tn TCP
257protocol implements a number of variables in the
258.Va net.inet.tcp
259branch of the
260.Xr sysctl 3
261MIB.
262.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
263.It Dv TCPCTL_DO_RFC1323
264.Pq Va rfc1323
265Implement the window scaling and timestamp options of RFC 1323
266(default is true).
267.It Dv TCPCTL_MSSDFLT
268.Pq Va mssdflt
269The default value used for the maximum segment size
270.Pq Dq MSS
271when no advice to the contrary is received from MSS negotiation.
272.It Dv TCPCTL_SENDSPACE
273.Pq Va sendspace
274Maximum
275.Tn TCP
276send window.
277.It Dv TCPCTL_RECVSPACE
278.Pq Va recvspace
279Maximum
280.Tn TCP
281receive window.
282.It Va log_in_vain
283Log any connection attempts to ports where there is not a socket
284accepting connections.
285The value of 1 limits the logging to
286.Tn SYN
287(connection establishment) packets only.
288That of 2 results in any
289.Tn TCP
290packets to closed ports being logged.
291Any value unlisted above disables the logging
292(default is 0, i.e., the logging is disabled).
293.It Va msl
294The Maximum Segment Lifetime, in milliseconds, for a packet.
295.It Va keepinit
296Timeout, in milliseconds, for new, non-established
297.Tn TCP
298connections.
299.It Va keepidle
300Amount of time, in milliseconds, that the connection must be idle
301before keepalive probes (if enabled) are sent.
302.It Va keepintvl
303The interval, in milliseconds, between keepalive probes sent to remote
304machines, when no response is received on a
305.Va keepidle
306probe.
307After
308.Dv TCPTV_KEEPCNT
309(default 8) probes are sent, with no response, the connection is dropped.
310.It Va always_keepalive
311Assume that
312.Dv SO_KEEPALIVE
313is set on all
314.Tn TCP
315connections, the kernel will
316periodically send a packet to the remote host to verify the connection
317is still up.
318.It Va icmp_may_rst
319Certain
320.Tn ICMP
321unreachable messages may abort connections in
322.Tn SYN-SENT
323state.
324.It Va do_tcpdrain
325Flush packets in the
326.Tn TCP
327reassembly queue if the system is low on mbufs.
328.It Va blackhole
329If enabled, disable sending of RST when a connection is attempted
330to a port where there is not a socket accepting connections.
331See
332.Xr blackhole 4 .
333.It Va delayed_ack
334Delay ACK to try and piggyback it onto a data packet.
335.It Va delacktime
336Maximum amount of time, in milliseconds, before a delayed ACK is sent.
337.It Va path_mtu_discovery
338Enable Path MTU Discovery.
339.It Va tcbhashsize
340Size of the
341.Tn TCP
342control-block hash table
343(read-only).
344This may be tuned using the kernel option
345.Dv TCBHASHSIZE
346or by setting
347.Va net.inet.tcp.tcbhashsize
348in the
349.Xr loader 8 .
350.It Va pcbcount
351Number of active process control blocks
352(read-only).
353.It Va syncookies
354Determines whether or not
355.Tn SYN
356cookies should be generated for outbound
357.Tn SYN-ACK
358packets.
359.Tn SYN
360cookies are a great help during
361.Tn SYN
362flood attacks, and are enabled by default.
363(See
364.Xr syncookies 4 . )
365.It Va isn_reseed_interval
366The interval (in seconds) specifying how often the secret data used in
367RFC 1948 initial sequence number calculations should be reseeded.
368By default, this variable is set to zero, indicating that
369no reseeding will occur.
370Reseeding should not be necessary, and will break
371.Dv TIME_WAIT
372recycling for a few minutes.
373.It Va rexmit_min , rexmit_slop
374Adjust the retransmit timer calculation for
375.Tn TCP .
376The slop is
377typically added to the raw calculation to take into account
378occasional variances that the
379.Tn SRTT
380(smoothed round-trip time)
381is unable to accommodate, while the minimum specifies an
382absolute minimum.
383While a number of
384.Tn TCP
385RFCs suggest a 1
386second minimum, these RFCs tend to focus on streaming behavior,
387and fail to deal with the fact that a 1 second minimum has severe
388detrimental effects over lossy interactive connections, such
389as a 802.11b wireless link, and over very fast but lossy
390connections for those cases not covered by the fast retransmit
391code.
392For this reason, we use 200ms of slop and a near-0
393minimum, which gives us an effective minimum of 200ms (similar to
394.Tn Linux ) .
395.It Va rfc3042
396Enable the Limited Transmit algorithm as described in RFC 3042.
397It helps avoid timeouts on lossy links and also when the congestion window
398is small, as happens on short transfers.
399.It Va rfc3390
400Enable support for RFC 3390, which allows for a variable-sized
401starting congestion window on new connections, depending on the
402maximum segment size.
403This helps throughput in general, but
404particularly affects short transfers and high-bandwidth large
405propagation-delay connections.
406.It Va sack.enable
407Enable support for RFC 2018, TCP Selective Acknowledgment option,
408which allows the receiver to inform the sender about all successfully
409arrived segments, allowing the sender to retransmit the missing segments
410only.
411.It Va sack.maxholes
412Maximum number of SACK holes per connection.
413Defaults to 128.
414.It Va sack.globalmaxholes
415Maximum number of SACK holes per system, across all connections.
416Defaults to 65536.
417.It Va maxtcptw
418When a TCP connection enters the
419.Dv TIME_WAIT
420state, its associated socket structure is freed, since it is of
421negligible size and use, and a new structure is allocated to contain a
422minimal amount of information necessary for sustaining a connection in
423this state, called the compressed TCP TIME_WAIT state.
424Since this structure is smaller than a socket structure, it can save
425a significant amount of system memory.
426The
427.Va net.inet.tcp.maxtcptw
428MIB variable controls the maximum number of these structures allocated.
429By default, it is initialized to
430.Va kern.ipc.maxsockets
431/ 5.
432.It Va nolocaltimewait
433Suppress creating of compressed TCP TIME_WAIT states for connections in
434which both endpoints are local.
435.It Va fast_finwait2_recycle
436Recycle
437.Tn TCP
438.Dv FIN_WAIT_2
439connections faster when the socket is marked as
440.Dv SBS_CANTRCVMORE
441(no user process has the socket open, data received on
442the socket cannot be read).
443The timeout used here is
444.Va finwait2_timeout .
445.It Va finwait2_timeout
446Timeout to use for fast recycling of
447.Tn TCP
448.Dv FIN_WAIT_2
449connections.
450Defaults to 60 seconds.
451.It Va ecn.enable
452Enable support for TCP Explicit Congestion Notification (ECN).
453ECN allows a TCP sender to reduce the transmission rate in order to
454avoid packet drops.
455.It Va ecn.maxretries
456Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
457specific connection. This is needed to help with connection establishment
458when a broken firewall is in the network path.
459.El
460.Sh ERRORS
461A socket operation may fail with one of the following errors returned:
462.Bl -tag -width Er
463.It Bq Er EISCONN
464when trying to establish a connection on a socket which
465already has one;
466.It Bq Er ENOBUFS
467when the system runs out of memory for
468an internal data structure;
469.It Bq Er ETIMEDOUT
470when a connection was dropped
471due to excessive retransmissions;
472.It Bq Er ECONNRESET
473when the remote peer
474forces the connection to be closed;
475.It Bq Er ECONNREFUSED
476when the remote
477peer actively refuses connection establishment (usually because
478no process is listening to the port);
479.It Bq Er EADDRINUSE
480when an attempt
481is made to create a socket with a port which has already been
482allocated;
483.It Bq Er EADDRNOTAVAIL
484when an attempt is made to create a
485socket with a network address for which no network interface
486exists;
487.It Bq Er EAFNOSUPPORT
488when an attempt is made to bind or connect a socket to a multicast
489address.
490.El
491.Sh SEE ALSO
492.Xr getsockopt 2 ,
493.Xr socket 2 ,
494.Xr sysctl 3 ,
495.Xr blackhole 4 ,
496.Xr inet 4 ,
497.Xr intro 4 ,
498.Xr ip 4 ,
499.Xr mod_cc 4 ,
500.Xr syncache 4 ,
501.Xr setkey 8
502.Rs
503.%A "V. Jacobson"
504.%A "R. Braden"
505.%A "D. Borman"
506.%T "TCP Extensions for High Performance"
507.%O "RFC 1323"
508.Re
509.Rs
510.%A "A. Heffernan"
511.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
512.%O "RFC 2385"
513.Re
514.Rs
515.%A "K. Ramakrishnan"
516.%A "S. Floyd"
517.%A "D. Black"
518.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
519.%O "RFC 3168"
520.Re
521.Sh HISTORY
522The
523.Tn TCP
524protocol appeared in
525.Bx 4.2 .
526The RFC 1323 extensions for window scaling and timestamps were added
527in
528.Bx 4.4 .
529The
530.Dv TCP_INFO
531option was introduced in
532.Tn Linux 2.6
533and is
534.Em subject to change .
535