xref: /freebsd/share/man/man4/tcp.4 (revision 884a2a699669ec61e2366e3e358342dbc94be24a)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.
3.\" Copyright (c) 2010-2011 The FreeBSD Foundation
4.\" All rights reserved.
5.\"
6.\" Portions of this documentation were written at the Centre for Advanced
7.\" Internet Architectures, Swinburne University of Technology, Melbourne,
8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
39.\" $FreeBSD$
40.\"
41.Dd February 15, 2011
42.Dt TCP 4
43.Os
44.Sh NAME
45.Nm tcp
46.Nd Internet Transmission Control Protocol
47.Sh SYNOPSIS
48.In sys/types.h
49.In sys/socket.h
50.In netinet/in.h
51.Ft int
52.Fn socket AF_INET SOCK_STREAM 0
53.Sh DESCRIPTION
54The
55.Tn TCP
56protocol provides reliable, flow-controlled, two-way
57transmission of data.
58It is a byte-stream protocol used to
59support the
60.Dv SOCK_STREAM
61abstraction.
62.Tn TCP
63uses the standard
64Internet address format and, in addition, provides a per-host
65collection of
66.Dq "port addresses" .
67Thus, each address is composed
68of an Internet address specifying the host and network,
69with a specific
70.Tn TCP
71port on the host identifying the peer entity.
72.Pp
73Sockets utilizing the
74.Tn TCP
75protocol are either
76.Dq active
77or
78.Dq passive .
79Active sockets initiate connections to passive
80sockets.
81By default,
82.Tn TCP
83sockets are created active; to create a
84passive socket, the
85.Xr listen 2
86system call must be used
87after binding the socket with the
88.Xr bind 2
89system call.
90Only passive sockets may use the
91.Xr accept 2
92call to accept incoming connections.
93Only active sockets may use the
94.Xr connect 2
95call to initiate connections.
96.Pp
97Passive sockets may
98.Dq underspecify
99their location to match
100incoming connection requests from multiple networks.
101This technique, termed
102.Dq "wildcard addressing" ,
103allows a single
104server to provide service to clients on multiple networks.
105To create a socket which listens on all networks, the Internet
106address
107.Dv INADDR_ANY
108must be bound.
109The
110.Tn TCP
111port may still be specified
112at this time; if the port is not specified, the system will assign one.
113Once a connection has been established, the socket's address is
114fixed by the peer entity's location.
115The address assigned to the
116socket is the address associated with the network interface
117through which packets are being transmitted and received.
118Normally, this address corresponds to the peer entity's network.
119.Pp
120.Tn TCP
121supports a number of socket options which can be set with
122.Xr setsockopt 2
123and tested with
124.Xr getsockopt 2 :
125.Bl -tag -width ".Dv TCP_CONGESTION"
126.It Dv TCP_INFO
127Information about a socket's underlying TCP session may be retrieved
128by passing the read-only option
129.Dv TCP_INFO
130to
131.Xr getsockopt 2 .
132It accepts a single argument: a pointer to an instance of
133.Vt "struct tcp_info" .
134.Pp
135This API is subject to change; consult the source to determine
136which fields are currently filled out by this option.
137.Fx
138specific additions include
139send window size,
140receive window size,
141and
142bandwidth-controlled window space.
143.It Dv TCP_CONGESTION
144Select or query the congestion control algorithm that TCP will use for the
145connection.
146See
147.Xr cc 4
148for details.
149.It Dv TCP_NODELAY
150Under most circumstances,
151.Tn TCP
152sends data when it is presented;
153when outstanding data has not yet been acknowledged, it gathers
154small amounts of output to be sent in a single packet once
155an acknowledgement is received.
156For a small number of clients, such as window systems
157that send a stream of mouse events which receive no replies,
158this packetization may cause significant delays.
159The boolean option
160.Dv TCP_NODELAY
161defeats this algorithm.
162.It Dv TCP_MAXSEG
163By default, a sender- and
164.No receiver- Ns Tn TCP
165will negotiate among themselves to determine the maximum segment size
166to be used for each connection.
167The
168.Dv TCP_MAXSEG
169option allows the user to determine the result of this negotiation,
170and to reduce it if desired.
171.It Dv TCP_NOOPT
172.Tn TCP
173usually sends a number of options in each packet, corresponding to
174various
175.Tn TCP
176extensions which are provided in this implementation.
177The boolean option
178.Dv TCP_NOOPT
179is provided to disable
180.Tn TCP
181option use on a per-connection basis.
182.It Dv TCP_NOPUSH
183By convention, the
184.No sender- Ns Tn TCP
185will set the
186.Dq push
187bit, and begin transmission immediately (if permitted) at the end of
188every user call to
189.Xr write 2
190or
191.Xr writev 2 .
192When this option is set to a non-zero value,
193.Tn TCP
194will delay sending any data at all until either the socket is closed,
195or the internal send buffer is filled.
196.It Dv TCP_MD5SIG
197This option enables the use of MD5 digests (also known as TCP-MD5)
198on writes to the specified socket.
199In the current release, only outgoing traffic is digested;
200digests on incoming traffic are not verified.
201The current default behavior for the system is to respond to a system
202advertising this option with TCP-MD5; this may change.
203.Pp
204One common use for this in a
205.Fx
206router deployment is to enable
207based routers to interwork with Cisco equipment at peering points.
208Support for this feature conforms to RFC 2385.
209Only IPv4
210.Pq Dv AF_INET
211sessions are supported.
212.Pp
213In order for this option to function correctly, it is necessary for the
214administrator to add a tcp-md5 key entry to the system's security
215associations database (SADB) using the
216.Xr setkey 8
217utility.
218This entry must have an SPI of 0x1000 and can therefore only be specified
219on a per-host basis at this time.
220.Pp
221If an SADB entry cannot be found for the destination, the outgoing traffic
222will have an invalid digest option prepended, and the following error message
223will be visible on the system console:
224.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
225.El
226.Pp
227The option level for the
228.Xr setsockopt 2
229call is the protocol number for
230.Tn TCP ,
231available from
232.Xr getprotobyname 3 ,
233or
234.Dv IPPROTO_TCP .
235All options are declared in
236.In netinet/tcp.h .
237.Pp
238Options at the
239.Tn IP
240transport level may be used with
241.Tn TCP ;
242see
243.Xr ip 4 .
244Incoming connection requests that are source-routed are noted,
245and the reverse source route is used in responding.
246.Pp
247The default congestion control algorithm for
248.Tn TCP
249is
250.Xr cc_newreno 4 .
251Other congestion control algorithms can be made available using the
252.Xr cc 4
253framework.
254.Ss MIB Variables
255The
256.Tn TCP
257protocol implements a number of variables in the
258.Va net.inet.tcp
259branch of the
260.Xr sysctl 3
261MIB.
262.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
263.It Dv TCPCTL_DO_RFC1323
264.Pq Va rfc1323
265Implement the window scaling and timestamp options of RFC 1323
266(default is true).
267.It Dv TCPCTL_MSSDFLT
268.Pq Va mssdflt
269The default value used for the maximum segment size
270.Pq Dq MSS
271when no advice to the contrary is received from MSS negotiation.
272.It Dv TCPCTL_SENDSPACE
273.Pq Va sendspace
274Maximum
275.Tn TCP
276send window.
277.It Dv TCPCTL_RECVSPACE
278.Pq Va recvspace
279Maximum
280.Tn TCP
281receive window.
282.It Va log_in_vain
283Log any connection attempts to ports where there is not a socket
284accepting connections.
285The value of 1 limits the logging to
286.Tn SYN
287(connection establishment) packets only.
288That of 2 results in any
289.Tn TCP
290packets to closed ports being logged.
291Any value unlisted above disables the logging
292(default is 0, i.e., the logging is disabled).
293.It Va slowstart_flightsize
294The number of packets allowed to be in-flight during the
295.Tn TCP
296slow-start phase on a non-local network.
297.It Va local_slowstart_flightsize
298The number of packets allowed to be in-flight during the
299.Tn TCP
300slow-start phase to local machines in the same subnet.
301.It Va msl
302The Maximum Segment Lifetime, in milliseconds, for a packet.
303.It Va keepinit
304Timeout, in milliseconds, for new, non-established
305.Tn TCP
306connections.
307.It Va keepidle
308Amount of time, in milliseconds, that the connection must be idle
309before keepalive probes (if enabled) are sent.
310.It Va keepintvl
311The interval, in milliseconds, between keepalive probes sent to remote
312machines, when no response is received on a
313.Va keepidle
314probe.
315After
316.Dv TCPTV_KEEPCNT
317(default 8) probes are sent, with no response, the connection is dropped.
318.It Va always_keepalive
319Assume that
320.Dv SO_KEEPALIVE
321is set on all
322.Tn TCP
323connections, the kernel will
324periodically send a packet to the remote host to verify the connection
325is still up.
326.It Va icmp_may_rst
327Certain
328.Tn ICMP
329unreachable messages may abort connections in
330.Tn SYN-SENT
331state.
332.It Va do_tcpdrain
333Flush packets in the
334.Tn TCP
335reassembly queue if the system is low on mbufs.
336.It Va blackhole
337If enabled, disable sending of RST when a connection is attempted
338to a port where there is not a socket accepting connections.
339See
340.Xr blackhole 4 .
341.It Va delayed_ack
342Delay ACK to try and piggyback it onto a data packet.
343.It Va delacktime
344Maximum amount of time, in milliseconds, before a delayed ACK is sent.
345.It Va path_mtu_discovery
346Enable Path MTU Discovery.
347.It Va tcbhashsize
348Size of the
349.Tn TCP
350control-block hash table
351(read-only).
352This may be tuned using the kernel option
353.Dv TCBHASHSIZE
354or by setting
355.Va net.inet.tcp.tcbhashsize
356in the
357.Xr loader 8 .
358.It Va pcbcount
359Number of active process control blocks
360(read-only).
361.It Va syncookies
362Determines whether or not
363.Tn SYN
364cookies should be generated for outbound
365.Tn SYN-ACK
366packets.
367.Tn SYN
368cookies are a great help during
369.Tn SYN
370flood attacks, and are enabled by default.
371(See
372.Xr syncookies 4 . )
373.It Va isn_reseed_interval
374The interval (in seconds) specifying how often the secret data used in
375RFC 1948 initial sequence number calculations should be reseeded.
376By default, this variable is set to zero, indicating that
377no reseeding will occur.
378Reseeding should not be necessary, and will break
379.Dv TIME_WAIT
380recycling for a few minutes.
381.It Va rexmit_min , rexmit_slop
382Adjust the retransmit timer calculation for
383.Tn TCP .
384The slop is
385typically added to the raw calculation to take into account
386occasional variances that the
387.Tn SRTT
388(smoothed round-trip time)
389is unable to accommodate, while the minimum specifies an
390absolute minimum.
391While a number of
392.Tn TCP
393RFCs suggest a 1
394second minimum, these RFCs tend to focus on streaming behavior,
395and fail to deal with the fact that a 1 second minimum has severe
396detrimental effects over lossy interactive connections, such
397as a 802.11b wireless link, and over very fast but lossy
398connections for those cases not covered by the fast retransmit
399code.
400For this reason, we use 200ms of slop and a near-0
401minimum, which gives us an effective minimum of 200ms (similar to
402.Tn Linux ) .
403.It Va rfc3042
404Enable the Limited Transmit algorithm as described in RFC 3042.
405It helps avoid timeouts on lossy links and also when the congestion window
406is small, as happens on short transfers.
407.It Va rfc3390
408Enable support for RFC 3390, which allows for a variable-sized
409starting congestion window on new connections, depending on the
410maximum segment size.
411This helps throughput in general, but
412particularly affects short transfers and high-bandwidth large
413propagation-delay connections.
414.Pp
415When this feature is enabled, the
416.Va slowstart_flightsize
417and
418.Va local_slowstart_flightsize
419settings are not observed for new
420connection slow starts, but they are still used for slow starts
421that occur when the connection has been idle and starts sending
422again.
423.It Va sack.enable
424Enable support for RFC 2018, TCP Selective Acknowledgment option,
425which allows the receiver to inform the sender about all successfully
426arrived segments, allowing the sender to retransmit the missing segments
427only.
428.It Va sack.maxholes
429Maximum number of SACK holes per connection.
430Defaults to 128.
431.It Va sack.globalmaxholes
432Maximum number of SACK holes per system, across all connections.
433Defaults to 65536.
434.It Va maxtcptw
435When a TCP connection enters the
436.Dv TIME_WAIT
437state, its associated socket structure is freed, since it is of
438negligible size and use, and a new structure is allocated to contain a
439minimal amount of information necessary for sustaining a connection in
440this state, called the compressed TCP TIME_WAIT state.
441Since this structure is smaller than a socket structure, it can save
442a significant amount of system memory.
443The
444.Va net.inet.tcp.maxtcptw
445MIB variable controls the maximum number of these structures allocated.
446By default, it is initialized to
447.Va kern.ipc.maxsockets
448/ 5.
449.It Va nolocaltimewait
450Suppress creating of compressed TCP TIME_WAIT states for connections in
451which both endpoints are local.
452.It Va fast_finwait2_recycle
453Recycle
454.Tn TCP
455.Dv FIN_WAIT_2
456connections faster when the socket is marked as
457.Dv SBS_CANTRCVMORE
458(no user process has the socket open, data received on
459the socket cannot be read).
460The timeout used here is
461.Va finwait2_timeout .
462.It Va finwait2_timeout
463Timeout to use for fast recycling of
464.Tn TCP
465.Dv FIN_WAIT_2
466connections.
467Defaults to 60 seconds.
468.It Va ecn.enable
469Enable support for TCP Explicit Congestion Notification (ECN).
470ECN allows a TCP sender to reduce the transmission rate in order to
471avoid packet drops.
472.It Va ecn.maxretries
473Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
474specific connection. This is needed to help with connection establishment
475when a broken firewall is in the network path.
476.El
477.Sh ERRORS
478A socket operation may fail with one of the following errors returned:
479.Bl -tag -width Er
480.It Bq Er EISCONN
481when trying to establish a connection on a socket which
482already has one;
483.It Bq Er ENOBUFS
484when the system runs out of memory for
485an internal data structure;
486.It Bq Er ETIMEDOUT
487when a connection was dropped
488due to excessive retransmissions;
489.It Bq Er ECONNRESET
490when the remote peer
491forces the connection to be closed;
492.It Bq Er ECONNREFUSED
493when the remote
494peer actively refuses connection establishment (usually because
495no process is listening to the port);
496.It Bq Er EADDRINUSE
497when an attempt
498is made to create a socket with a port which has already been
499allocated;
500.It Bq Er EADDRNOTAVAIL
501when an attempt is made to create a
502socket with a network address for which no network interface
503exists;
504.It Bq Er EAFNOSUPPORT
505when an attempt is made to bind or connect a socket to a multicast
506address.
507.El
508.Sh SEE ALSO
509.Xr getsockopt 2 ,
510.Xr socket 2 ,
511.Xr sysctl 3 ,
512.Xr blackhole 4 ,
513.Xr cc 4 ,
514.Xr inet 4 ,
515.Xr intro 4 ,
516.Xr ip 4 ,
517.Xr syncache 4 ,
518.Xr setkey 8
519.Rs
520.%A "V. Jacobson"
521.%A "R. Braden"
522.%A "D. Borman"
523.%T "TCP Extensions for High Performance"
524.%O "RFC 1323"
525.Re
526.Rs
527.%A "A. Heffernan"
528.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
529.%O "RFC 2385"
530.Re
531.Rs
532.%A "K. Ramakrishnan"
533.%A "S. Floyd"
534.%A "D. Black"
535.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
536.%O "RFC 3168"
537.Re
538.Sh HISTORY
539The
540.Tn TCP
541protocol appeared in
542.Bx 4.2 .
543The RFC 1323 extensions for window scaling and timestamps were added
544in
545.Bx 4.4 .
546The
547.Dv TCP_INFO
548option was introduced in
549.Tn Linux 2.6
550and is
551.Em subject to change .
552