xref: /freebsd/share/man/man4/tcp.4 (revision 9a0c3479e22feda1bdb2db4b97f9deb1b5fa6269)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.
3.\" Copyright (c) 2010-2011 The FreeBSD Foundation
4.\" All rights reserved.
5.\"
6.\" Portions of this documentation were written at the Centre for Advanced
7.\" Internet Architectures, Swinburne University of Technology, Melbourne,
8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
39.\" $FreeBSD$
40.\"
41.Dd March 7, 2012
42.Dt TCP 4
43.Os
44.Sh NAME
45.Nm tcp
46.Nd Internet Transmission Control Protocol
47.Sh SYNOPSIS
48.In sys/types.h
49.In sys/socket.h
50.In netinet/in.h
51.Ft int
52.Fn socket AF_INET SOCK_STREAM 0
53.Sh DESCRIPTION
54The
55.Tn TCP
56protocol provides reliable, flow-controlled, two-way
57transmission of data.
58It is a byte-stream protocol used to
59support the
60.Dv SOCK_STREAM
61abstraction.
62.Tn TCP
63uses the standard
64Internet address format and, in addition, provides a per-host
65collection of
66.Dq "port addresses" .
67Thus, each address is composed
68of an Internet address specifying the host and network,
69with a specific
70.Tn TCP
71port on the host identifying the peer entity.
72.Pp
73Sockets utilizing the
74.Tn TCP
75protocol are either
76.Dq active
77or
78.Dq passive .
79Active sockets initiate connections to passive
80sockets.
81By default,
82.Tn TCP
83sockets are created active; to create a
84passive socket, the
85.Xr listen 2
86system call must be used
87after binding the socket with the
88.Xr bind 2
89system call.
90Only passive sockets may use the
91.Xr accept 2
92call to accept incoming connections.
93Only active sockets may use the
94.Xr connect 2
95call to initiate connections.
96.Pp
97Passive sockets may
98.Dq underspecify
99their location to match
100incoming connection requests from multiple networks.
101This technique, termed
102.Dq "wildcard addressing" ,
103allows a single
104server to provide service to clients on multiple networks.
105To create a socket which listens on all networks, the Internet
106address
107.Dv INADDR_ANY
108must be bound.
109The
110.Tn TCP
111port may still be specified
112at this time; if the port is not specified, the system will assign one.
113Once a connection has been established, the socket's address is
114fixed by the peer entity's location.
115The address assigned to the
116socket is the address associated with the network interface
117through which packets are being transmitted and received.
118Normally, this address corresponds to the peer entity's network.
119.Pp
120.Tn TCP
121supports a number of socket options which can be set with
122.Xr setsockopt 2
123and tested with
124.Xr getsockopt 2 :
125.Bl -tag -width ".Dv TCP_CONGESTION"
126.It Dv TCP_INFO
127Information about a socket's underlying TCP session may be retrieved
128by passing the read-only option
129.Dv TCP_INFO
130to
131.Xr getsockopt 2 .
132It accepts a single argument: a pointer to an instance of
133.Vt "struct tcp_info" .
134.Pp
135This API is subject to change; consult the source to determine
136which fields are currently filled out by this option.
137.Fx
138specific additions include
139send window size,
140receive window size,
141and
142bandwidth-controlled window space.
143.It Dv TCP_CONGESTION
144Select or query the congestion control algorithm that TCP will use for the
145connection.
146See
147.Xr mod_cc 4
148for details.
149.It Dv TCP_KEEPINIT
150This write-only
151.Xr setsockopt 2
152option accepts a per-socket timeout argument of
153.Vt "u_int"
154in seconds, for new, non-established
155.Tn TCP
156connections.
157For the global default in milliseconds see
158.Va keepinit
159in the
160.Sx MIB Variables
161section further down.
162.It Dv TCP_KEEPIDLE
163This write-only
164.Xr setsockopt 2
165option accepts an argument of
166.Vt "u_int"
167for the amount of time, in seconds, that the connection must be idle
168before keepalive probes (if enabled) are sent for the connection of this
169socket.
170If set on a listening socket, the value is inherited by the newly created
171socket upon
172.Xr accept 2 .
173For the global default in milliseconds see
174.Va keepidle
175in the
176.Sx MIB Variables
177section further down.
178.It Dv TCP_KEEPINTVL
179This write-only
180.Xr setsockopt 2
181option accepts an argument of
182.Vt "u_int"
183to set the per-socket interval, in seconds, between keepalive probes sent
184to a peer.
185If set on a listening socket, the value is inherited by the newly created
186socket upon
187.Xr accept 2 .
188For the global default in milliseconds see
189.Va keepintvl
190in the
191.Sx MIB Variables
192section further down.
193.It Dv TCP_KEEPCNT
194This write-only
195.Xr setsockopt 2
196option accepts an argument of
197.Vt "u_int"
198and allows a per-socket tuning of the number of probes sent, with no response,
199before the connection will be dropped.
200If set on a listening socket, the value is inherited by the newly created
201socket upon
202.Xr accept 2 .
203For the global default see the
204.Va keepcnt
205in the
206.Sx MIB Variables
207section further down.
208.It Dv TCP_NODELAY
209Under most circumstances,
210.Tn TCP
211sends data when it is presented;
212when outstanding data has not yet been acknowledged, it gathers
213small amounts of output to be sent in a single packet once
214an acknowledgement is received.
215For a small number of clients, such as window systems
216that send a stream of mouse events which receive no replies,
217this packetization may cause significant delays.
218The boolean option
219.Dv TCP_NODELAY
220defeats this algorithm.
221.It Dv TCP_MAXSEG
222By default, a sender- and
223.No receiver- Ns Tn TCP
224will negotiate among themselves to determine the maximum segment size
225to be used for each connection.
226The
227.Dv TCP_MAXSEG
228option allows the user to determine the result of this negotiation,
229and to reduce it if desired.
230.It Dv TCP_NOOPT
231.Tn TCP
232usually sends a number of options in each packet, corresponding to
233various
234.Tn TCP
235extensions which are provided in this implementation.
236The boolean option
237.Dv TCP_NOOPT
238is provided to disable
239.Tn TCP
240option use on a per-connection basis.
241.It Dv TCP_NOPUSH
242By convention, the
243.No sender- Ns Tn TCP
244will set the
245.Dq push
246bit, and begin transmission immediately (if permitted) at the end of
247every user call to
248.Xr write 2
249or
250.Xr writev 2 .
251When this option is set to a non-zero value,
252.Tn TCP
253will delay sending any data at all until either the socket is closed,
254or the internal send buffer is filled.
255.It Dv TCP_MD5SIG
256This option enables the use of MD5 digests (also known as TCP-MD5)
257on writes to the specified socket.
258Outgoing traffic is digested;
259digests on incoming traffic are verified if the
260.Va net.inet.tcp.signature_verify_input
261sysctl is nonzero.
262The current default behavior for the system is to respond to a system
263advertising this option with TCP-MD5; this may change.
264.Pp
265One common use for this in a
266.Fx
267router deployment is to enable
268based routers to interwork with Cisco equipment at peering points.
269Support for this feature conforms to RFC 2385.
270Only IPv4
271.Pq Dv AF_INET
272sessions are supported.
273.Pp
274In order for this option to function correctly, it is necessary for the
275administrator to add a tcp-md5 key entry to the system's security
276associations database (SADB) using the
277.Xr setkey 8
278utility.
279This entry must have an SPI of 0x1000 and can therefore only be specified
280on a per-host basis at this time.
281.Pp
282If an SADB entry cannot be found for the destination, the outgoing traffic
283will have an invalid digest option prepended, and the following error message
284will be visible on the system console:
285.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
286.El
287.Pp
288The option level for the
289.Xr setsockopt 2
290call is the protocol number for
291.Tn TCP ,
292available from
293.Xr getprotobyname 3 ,
294or
295.Dv IPPROTO_TCP .
296All options are declared in
297.In netinet/tcp.h .
298.Pp
299Options at the
300.Tn IP
301transport level may be used with
302.Tn TCP ;
303see
304.Xr ip 4 .
305Incoming connection requests that are source-routed are noted,
306and the reverse source route is used in responding.
307.Pp
308The default congestion control algorithm for
309.Tn TCP
310is
311.Xr cc_newreno 4 .
312Other congestion control algorithms can be made available using the
313.Xr mod_cc 4
314framework.
315.Ss MIB Variables
316The
317.Tn TCP
318protocol implements a number of variables in the
319.Va net.inet.tcp
320branch of the
321.Xr sysctl 3
322MIB.
323.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
324.It Dv TCPCTL_DO_RFC1323
325.Pq Va rfc1323
326Implement the window scaling and timestamp options of RFC 1323
327(default is true).
328.It Dv TCPCTL_MSSDFLT
329.Pq Va mssdflt
330The default value used for the maximum segment size
331.Pq Dq MSS
332when no advice to the contrary is received from MSS negotiation.
333.It Dv TCPCTL_SENDSPACE
334.Pq Va sendspace
335Maximum
336.Tn TCP
337send window.
338.It Dv TCPCTL_RECVSPACE
339.Pq Va recvspace
340Maximum
341.Tn TCP
342receive window.
343.It Va log_in_vain
344Log any connection attempts to ports where there is not a socket
345accepting connections.
346The value of 1 limits the logging to
347.Tn SYN
348(connection establishment) packets only.
349That of 2 results in any
350.Tn TCP
351packets to closed ports being logged.
352Any value unlisted above disables the logging
353(default is 0, i.e., the logging is disabled).
354.It Va msl
355The Maximum Segment Lifetime, in milliseconds, for a packet.
356.It Va keepinit
357Timeout, in milliseconds, for new, non-established
358.Tn TCP
359connections.
360The default is 75000 msec.
361.It Va keepidle
362Amount of time, in milliseconds, that the connection must be idle
363before keepalive probes (if enabled) are sent.
364The default is 7200000 msec (2 hours).
365.It Va keepintvl
366The interval, in milliseconds, between keepalive probes sent to remote
367machines, when no response is received on a
368.Va keepidle
369probe.
370The default is 75000 msec.
371.It Va keepcnt
372Number of probes sent, with no response, before a connection
373is dropped.
374The default is 8 packets.
375.It Va always_keepalive
376Assume that
377.Dv SO_KEEPALIVE
378is set on all
379.Tn TCP
380connections, the kernel will
381periodically send a packet to the remote host to verify the connection
382is still up.
383.It Va icmp_may_rst
384Certain
385.Tn ICMP
386unreachable messages may abort connections in
387.Tn SYN-SENT
388state.
389.It Va do_tcpdrain
390Flush packets in the
391.Tn TCP
392reassembly queue if the system is low on mbufs.
393.It Va blackhole
394If enabled, disable sending of RST when a connection is attempted
395to a port where there is not a socket accepting connections.
396See
397.Xr blackhole 4 .
398.It Va delayed_ack
399Delay ACK to try and piggyback it onto a data packet.
400.It Va delacktime
401Maximum amount of time, in milliseconds, before a delayed ACK is sent.
402.It Va path_mtu_discovery
403Enable Path MTU Discovery.
404.It Va tcbhashsize
405Size of the
406.Tn TCP
407control-block hash table
408(read-only).
409This may be tuned using the kernel option
410.Dv TCBHASHSIZE
411or by setting
412.Va net.inet.tcp.tcbhashsize
413in the
414.Xr loader 8 .
415.It Va pcbcount
416Number of active process control blocks
417(read-only).
418.It Va syncookies
419Determines whether or not
420.Tn SYN
421cookies should be generated for outbound
422.Tn SYN-ACK
423packets.
424.Tn SYN
425cookies are a great help during
426.Tn SYN
427flood attacks, and are enabled by default.
428(See
429.Xr syncookies 4 . )
430.It Va isn_reseed_interval
431The interval (in seconds) specifying how often the secret data used in
432RFC 1948 initial sequence number calculations should be reseeded.
433By default, this variable is set to zero, indicating that
434no reseeding will occur.
435Reseeding should not be necessary, and will break
436.Dv TIME_WAIT
437recycling for a few minutes.
438.It Va rexmit_min , rexmit_slop
439Adjust the retransmit timer calculation for
440.Tn TCP .
441The slop is
442typically added to the raw calculation to take into account
443occasional variances that the
444.Tn SRTT
445(smoothed round-trip time)
446is unable to accommodate, while the minimum specifies an
447absolute minimum.
448While a number of
449.Tn TCP
450RFCs suggest a 1
451second minimum, these RFCs tend to focus on streaming behavior,
452and fail to deal with the fact that a 1 second minimum has severe
453detrimental effects over lossy interactive connections, such
454as a 802.11b wireless link, and over very fast but lossy
455connections for those cases not covered by the fast retransmit
456code.
457For this reason, we use 200ms of slop and a near-0
458minimum, which gives us an effective minimum of 200ms (similar to
459.Tn Linux ) .
460.It Va rfc3042
461Enable the Limited Transmit algorithm as described in RFC 3042.
462It helps avoid timeouts on lossy links and also when the congestion window
463is small, as happens on short transfers.
464.It Va rfc3390
465Enable support for RFC 3390, which allows for a variable-sized
466starting congestion window on new connections, depending on the
467maximum segment size.
468This helps throughput in general, but
469particularly affects short transfers and high-bandwidth large
470propagation-delay connections.
471.It Va sack.enable
472Enable support for RFC 2018, TCP Selective Acknowledgment option,
473which allows the receiver to inform the sender about all successfully
474arrived segments, allowing the sender to retransmit the missing segments
475only.
476.It Va sack.maxholes
477Maximum number of SACK holes per connection.
478Defaults to 128.
479.It Va sack.globalmaxholes
480Maximum number of SACK holes per system, across all connections.
481Defaults to 65536.
482.It Va maxtcptw
483When a TCP connection enters the
484.Dv TIME_WAIT
485state, its associated socket structure is freed, since it is of
486negligible size and use, and a new structure is allocated to contain a
487minimal amount of information necessary for sustaining a connection in
488this state, called the compressed TCP TIME_WAIT state.
489Since this structure is smaller than a socket structure, it can save
490a significant amount of system memory.
491The
492.Va net.inet.tcp.maxtcptw
493MIB variable controls the maximum number of these structures allocated.
494By default, it is initialized to
495.Va kern.ipc.maxsockets
496/ 5.
497.It Va nolocaltimewait
498Suppress creating of compressed TCP TIME_WAIT states for connections in
499which both endpoints are local.
500.It Va fast_finwait2_recycle
501Recycle
502.Tn TCP
503.Dv FIN_WAIT_2
504connections faster when the socket is marked as
505.Dv SBS_CANTRCVMORE
506(no user process has the socket open, data received on
507the socket cannot be read).
508The timeout used here is
509.Va finwait2_timeout .
510.It Va finwait2_timeout
511Timeout to use for fast recycling of
512.Tn TCP
513.Dv FIN_WAIT_2
514connections.
515Defaults to 60 seconds.
516.It Va ecn.enable
517Enable support for TCP Explicit Congestion Notification (ECN).
518ECN allows a TCP sender to reduce the transmission rate in order to
519avoid packet drops.
520.It Va ecn.maxretries
521Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
522specific connection. This is needed to help with connection establishment
523when a broken firewall is in the network path.
524.El
525.Sh ERRORS
526A socket operation may fail with one of the following errors returned:
527.Bl -tag -width Er
528.It Bq Er EISCONN
529when trying to establish a connection on a socket which
530already has one;
531.It Bq Er ENOBUFS
532when the system runs out of memory for
533an internal data structure;
534.It Bq Er ETIMEDOUT
535when a connection was dropped
536due to excessive retransmissions;
537.It Bq Er ECONNRESET
538when the remote peer
539forces the connection to be closed;
540.It Bq Er ECONNREFUSED
541when the remote
542peer actively refuses connection establishment (usually because
543no process is listening to the port);
544.It Bq Er EADDRINUSE
545when an attempt
546is made to create a socket with a port which has already been
547allocated;
548.It Bq Er EADDRNOTAVAIL
549when an attempt is made to create a
550socket with a network address for which no network interface
551exists;
552.It Bq Er EAFNOSUPPORT
553when an attempt is made to bind or connect a socket to a multicast
554address.
555.El
556.Sh SEE ALSO
557.Xr getsockopt 2 ,
558.Xr socket 2 ,
559.Xr sysctl 3 ,
560.Xr blackhole 4 ,
561.Xr inet 4 ,
562.Xr intro 4 ,
563.Xr ip 4 ,
564.Xr mod_cc 4 ,
565.Xr siftr 4 ,
566.Xr syncache 4 ,
567.Xr setkey 8
568.Rs
569.%A "V. Jacobson"
570.%A "R. Braden"
571.%A "D. Borman"
572.%T "TCP Extensions for High Performance"
573.%O "RFC 1323"
574.Re
575.Rs
576.%A "A. Heffernan"
577.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
578.%O "RFC 2385"
579.Re
580.Rs
581.%A "K. Ramakrishnan"
582.%A "S. Floyd"
583.%A "D. Black"
584.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
585.%O "RFC 3168"
586.Re
587.Sh HISTORY
588The
589.Tn TCP
590protocol appeared in
591.Bx 4.2 .
592The RFC 1323 extensions for window scaling and timestamps were added
593in
594.Bx 4.4 .
595The
596.Dv TCP_INFO
597option was introduced in
598.Tn Linux 2.6
599and is
600.Em subject to change .
601