xref: /freebsd/share/man/man4/tcp.4 (revision a0dd79dbdf917a8fbe2762d668f05a7c9f682b22)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.
3.\" Copyright (c) 2010-2011 The FreeBSD Foundation
4.\" All rights reserved.
5.\"
6.\" Portions of this documentation were written at the Centre for Advanced
7.\" Internet Architectures, Swinburne University of Technology, Melbourne,
8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
39.\" $FreeBSD$
40.\"
41.Dd February 5, 2012
42.Dt TCP 4
43.Os
44.Sh NAME
45.Nm tcp
46.Nd Internet Transmission Control Protocol
47.Sh SYNOPSIS
48.In sys/types.h
49.In sys/socket.h
50.In netinet/in.h
51.Ft int
52.Fn socket AF_INET SOCK_STREAM 0
53.Sh DESCRIPTION
54The
55.Tn TCP
56protocol provides reliable, flow-controlled, two-way
57transmission of data.
58It is a byte-stream protocol used to
59support the
60.Dv SOCK_STREAM
61abstraction.
62.Tn TCP
63uses the standard
64Internet address format and, in addition, provides a per-host
65collection of
66.Dq "port addresses" .
67Thus, each address is composed
68of an Internet address specifying the host and network,
69with a specific
70.Tn TCP
71port on the host identifying the peer entity.
72.Pp
73Sockets utilizing the
74.Tn TCP
75protocol are either
76.Dq active
77or
78.Dq passive .
79Active sockets initiate connections to passive
80sockets.
81By default,
82.Tn TCP
83sockets are created active; to create a
84passive socket, the
85.Xr listen 2
86system call must be used
87after binding the socket with the
88.Xr bind 2
89system call.
90Only passive sockets may use the
91.Xr accept 2
92call to accept incoming connections.
93Only active sockets may use the
94.Xr connect 2
95call to initiate connections.
96.Pp
97Passive sockets may
98.Dq underspecify
99their location to match
100incoming connection requests from multiple networks.
101This technique, termed
102.Dq "wildcard addressing" ,
103allows a single
104server to provide service to clients on multiple networks.
105To create a socket which listens on all networks, the Internet
106address
107.Dv INADDR_ANY
108must be bound.
109The
110.Tn TCP
111port may still be specified
112at this time; if the port is not specified, the system will assign one.
113Once a connection has been established, the socket's address is
114fixed by the peer entity's location.
115The address assigned to the
116socket is the address associated with the network interface
117through which packets are being transmitted and received.
118Normally, this address corresponds to the peer entity's network.
119.Pp
120.Tn TCP
121supports a number of socket options which can be set with
122.Xr setsockopt 2
123and tested with
124.Xr getsockopt 2 :
125.Bl -tag -width ".Dv TCP_CONGESTION"
126.It Dv TCP_INFO
127Information about a socket's underlying TCP session may be retrieved
128by passing the read-only option
129.Dv TCP_INFO
130to
131.Xr getsockopt 2 .
132It accepts a single argument: a pointer to an instance of
133.Vt "struct tcp_info" .
134.Pp
135This API is subject to change; consult the source to determine
136which fields are currently filled out by this option.
137.Fx
138specific additions include
139send window size,
140receive window size,
141and
142bandwidth-controlled window space.
143.It Dv TCP_CONGESTION
144Select or query the congestion control algorithm that TCP will use for the
145connection.
146See
147.Xr mod_cc 4
148for details.
149.It Dv TCP_KEEPINIT
150This write-only
151.Xr setsockopt 2
152option accepts a per-socket timeout argument of
153.Vt "u_int"
154in seconds, for new, non-established
155.Tn TCP
156connections.
157For the global default in milliseconds see
158.Va keepinit
159in the
160.Sx MIB Variables
161section further down.
162.It Dv TCP_KEEPIDLE
163This write-only
164.Xr setsockopt 2
165option accepts an argument of
166.Vt "u_int"
167for the amount of time, in seconds, that the connection must be idle
168before keepalive probes (if enabled) are sent for the connection of this
169socket.
170If set on a listening socket, the value is inherited by the newly created
171socket upon
172.Xr accept 2 .
173For the global default in milliseconds see
174.Va keepidle
175in the
176.Sx MIB Variables
177section further down.
178.It Dv TCP_KEEPINTVL
179This write-only
180.Xr setsockopt 2
181option accepts an argument of
182.Vt "u_int"
183to set the per-socket interval, in seconds, between keepalive probes sent
184to a peer.
185If set on a listening socket, the value is inherited by the newly created
186socket upon
187.Xr accept 2 .
188For the global default in milliseconds see
189.Va keepintvl
190in the
191.Sx MIB Variables
192section further down.
193.It Dv TCP_KEEPCNT
194This write-only
195.Xr setsockopt 2
196option accepts an argument of
197.Vt "u_int"
198and allows a per-socket tuning of the number of probes sent, with no response,
199before the connection will be dropped.
200If set on a listening socket, the value is inherited by the newly created
201socket upon
202.Xr accept 2 .
203For the global default see the
204.Va keepcnt
205in the
206.Sx MIB Variables
207section further down.
208.It Dv TCP_NODELAY
209Under most circumstances,
210.Tn TCP
211sends data when it is presented;
212when outstanding data has not yet been acknowledged, it gathers
213small amounts of output to be sent in a single packet once
214an acknowledgement is received.
215For a small number of clients, such as window systems
216that send a stream of mouse events which receive no replies,
217this packetization may cause significant delays.
218The boolean option
219.Dv TCP_NODELAY
220defeats this algorithm.
221.It Dv TCP_MAXSEG
222By default, a sender- and
223.No receiver- Ns Tn TCP
224will negotiate among themselves to determine the maximum segment size
225to be used for each connection.
226The
227.Dv TCP_MAXSEG
228option allows the user to determine the result of this negotiation,
229and to reduce it if desired.
230.It Dv TCP_NOOPT
231.Tn TCP
232usually sends a number of options in each packet, corresponding to
233various
234.Tn TCP
235extensions which are provided in this implementation.
236The boolean option
237.Dv TCP_NOOPT
238is provided to disable
239.Tn TCP
240option use on a per-connection basis.
241.It Dv TCP_NOPUSH
242By convention, the
243.No sender- Ns Tn TCP
244will set the
245.Dq push
246bit, and begin transmission immediately (if permitted) at the end of
247every user call to
248.Xr write 2
249or
250.Xr writev 2 .
251When this option is set to a non-zero value,
252.Tn TCP
253will delay sending any data at all until either the socket is closed,
254or the internal send buffer is filled.
255.It Dv TCP_MD5SIG
256This option enables the use of MD5 digests (also known as TCP-MD5)
257on writes to the specified socket.
258In the current release, only outgoing traffic is digested;
259digests on incoming traffic are not verified.
260The current default behavior for the system is to respond to a system
261advertising this option with TCP-MD5; this may change.
262.Pp
263One common use for this in a
264.Fx
265router deployment is to enable
266based routers to interwork with Cisco equipment at peering points.
267Support for this feature conforms to RFC 2385.
268Only IPv4
269.Pq Dv AF_INET
270sessions are supported.
271.Pp
272In order for this option to function correctly, it is necessary for the
273administrator to add a tcp-md5 key entry to the system's security
274associations database (SADB) using the
275.Xr setkey 8
276utility.
277This entry must have an SPI of 0x1000 and can therefore only be specified
278on a per-host basis at this time.
279.Pp
280If an SADB entry cannot be found for the destination, the outgoing traffic
281will have an invalid digest option prepended, and the following error message
282will be visible on the system console:
283.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
284.El
285.Pp
286The option level for the
287.Xr setsockopt 2
288call is the protocol number for
289.Tn TCP ,
290available from
291.Xr getprotobyname 3 ,
292or
293.Dv IPPROTO_TCP .
294All options are declared in
295.In netinet/tcp.h .
296.Pp
297Options at the
298.Tn IP
299transport level may be used with
300.Tn TCP ;
301see
302.Xr ip 4 .
303Incoming connection requests that are source-routed are noted,
304and the reverse source route is used in responding.
305.Pp
306The default congestion control algorithm for
307.Tn TCP
308is
309.Xr cc_newreno 4 .
310Other congestion control algorithms can be made available using the
311.Xr mod_cc 4
312framework.
313.Ss MIB Variables
314The
315.Tn TCP
316protocol implements a number of variables in the
317.Va net.inet.tcp
318branch of the
319.Xr sysctl 3
320MIB.
321.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
322.It Dv TCPCTL_DO_RFC1323
323.Pq Va rfc1323
324Implement the window scaling and timestamp options of RFC 1323
325(default is true).
326.It Dv TCPCTL_MSSDFLT
327.Pq Va mssdflt
328The default value used for the maximum segment size
329.Pq Dq MSS
330when no advice to the contrary is received from MSS negotiation.
331.It Dv TCPCTL_SENDSPACE
332.Pq Va sendspace
333Maximum
334.Tn TCP
335send window.
336.It Dv TCPCTL_RECVSPACE
337.Pq Va recvspace
338Maximum
339.Tn TCP
340receive window.
341.It Va log_in_vain
342Log any connection attempts to ports where there is not a socket
343accepting connections.
344The value of 1 limits the logging to
345.Tn SYN
346(connection establishment) packets only.
347That of 2 results in any
348.Tn TCP
349packets to closed ports being logged.
350Any value unlisted above disables the logging
351(default is 0, i.e., the logging is disabled).
352.It Va msl
353The Maximum Segment Lifetime, in milliseconds, for a packet.
354.It Va keepinit
355Timeout, in milliseconds, for new, non-established
356.Tn TCP
357connections.
358The default is 75000 msec.
359.It Va keepidle
360Amount of time, in milliseconds, that the connection must be idle
361before keepalive probes (if enabled) are sent.
362The default is 7200000 msec (2 hours).
363.It Va keepintvl
364The interval, in milliseconds, between keepalive probes sent to remote
365machines, when no response is received on a
366.Va keepidle
367probe.
368The default is 75000 msec.
369.It Va keepcnt
370Number of probes sent, with no response, before a connection
371is dropped.
372The default is 8 packets.
373.It Va always_keepalive
374Assume that
375.Dv SO_KEEPALIVE
376is set on all
377.Tn TCP
378connections, the kernel will
379periodically send a packet to the remote host to verify the connection
380is still up.
381.It Va icmp_may_rst
382Certain
383.Tn ICMP
384unreachable messages may abort connections in
385.Tn SYN-SENT
386state.
387.It Va do_tcpdrain
388Flush packets in the
389.Tn TCP
390reassembly queue if the system is low on mbufs.
391.It Va blackhole
392If enabled, disable sending of RST when a connection is attempted
393to a port where there is not a socket accepting connections.
394See
395.Xr blackhole 4 .
396.It Va delayed_ack
397Delay ACK to try and piggyback it onto a data packet.
398.It Va delacktime
399Maximum amount of time, in milliseconds, before a delayed ACK is sent.
400.It Va path_mtu_discovery
401Enable Path MTU Discovery.
402.It Va tcbhashsize
403Size of the
404.Tn TCP
405control-block hash table
406(read-only).
407This may be tuned using the kernel option
408.Dv TCBHASHSIZE
409or by setting
410.Va net.inet.tcp.tcbhashsize
411in the
412.Xr loader 8 .
413.It Va pcbcount
414Number of active process control blocks
415(read-only).
416.It Va syncookies
417Determines whether or not
418.Tn SYN
419cookies should be generated for outbound
420.Tn SYN-ACK
421packets.
422.Tn SYN
423cookies are a great help during
424.Tn SYN
425flood attacks, and are enabled by default.
426(See
427.Xr syncookies 4 . )
428.It Va isn_reseed_interval
429The interval (in seconds) specifying how often the secret data used in
430RFC 1948 initial sequence number calculations should be reseeded.
431By default, this variable is set to zero, indicating that
432no reseeding will occur.
433Reseeding should not be necessary, and will break
434.Dv TIME_WAIT
435recycling for a few minutes.
436.It Va rexmit_min , rexmit_slop
437Adjust the retransmit timer calculation for
438.Tn TCP .
439The slop is
440typically added to the raw calculation to take into account
441occasional variances that the
442.Tn SRTT
443(smoothed round-trip time)
444is unable to accommodate, while the minimum specifies an
445absolute minimum.
446While a number of
447.Tn TCP
448RFCs suggest a 1
449second minimum, these RFCs tend to focus on streaming behavior,
450and fail to deal with the fact that a 1 second minimum has severe
451detrimental effects over lossy interactive connections, such
452as a 802.11b wireless link, and over very fast but lossy
453connections for those cases not covered by the fast retransmit
454code.
455For this reason, we use 200ms of slop and a near-0
456minimum, which gives us an effective minimum of 200ms (similar to
457.Tn Linux ) .
458.It Va rfc3042
459Enable the Limited Transmit algorithm as described in RFC 3042.
460It helps avoid timeouts on lossy links and also when the congestion window
461is small, as happens on short transfers.
462.It Va rfc3390
463Enable support for RFC 3390, which allows for a variable-sized
464starting congestion window on new connections, depending on the
465maximum segment size.
466This helps throughput in general, but
467particularly affects short transfers and high-bandwidth large
468propagation-delay connections.
469.It Va sack.enable
470Enable support for RFC 2018, TCP Selective Acknowledgment option,
471which allows the receiver to inform the sender about all successfully
472arrived segments, allowing the sender to retransmit the missing segments
473only.
474.It Va sack.maxholes
475Maximum number of SACK holes per connection.
476Defaults to 128.
477.It Va sack.globalmaxholes
478Maximum number of SACK holes per system, across all connections.
479Defaults to 65536.
480.It Va maxtcptw
481When a TCP connection enters the
482.Dv TIME_WAIT
483state, its associated socket structure is freed, since it is of
484negligible size and use, and a new structure is allocated to contain a
485minimal amount of information necessary for sustaining a connection in
486this state, called the compressed TCP TIME_WAIT state.
487Since this structure is smaller than a socket structure, it can save
488a significant amount of system memory.
489The
490.Va net.inet.tcp.maxtcptw
491MIB variable controls the maximum number of these structures allocated.
492By default, it is initialized to
493.Va kern.ipc.maxsockets
494/ 5.
495.It Va nolocaltimewait
496Suppress creating of compressed TCP TIME_WAIT states for connections in
497which both endpoints are local.
498.It Va fast_finwait2_recycle
499Recycle
500.Tn TCP
501.Dv FIN_WAIT_2
502connections faster when the socket is marked as
503.Dv SBS_CANTRCVMORE
504(no user process has the socket open, data received on
505the socket cannot be read).
506The timeout used here is
507.Va finwait2_timeout .
508.It Va finwait2_timeout
509Timeout to use for fast recycling of
510.Tn TCP
511.Dv FIN_WAIT_2
512connections.
513Defaults to 60 seconds.
514.It Va ecn.enable
515Enable support for TCP Explicit Congestion Notification (ECN).
516ECN allows a TCP sender to reduce the transmission rate in order to
517avoid packet drops.
518.It Va ecn.maxretries
519Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
520specific connection. This is needed to help with connection establishment
521when a broken firewall is in the network path.
522.El
523.Sh ERRORS
524A socket operation may fail with one of the following errors returned:
525.Bl -tag -width Er
526.It Bq Er EISCONN
527when trying to establish a connection on a socket which
528already has one;
529.It Bq Er ENOBUFS
530when the system runs out of memory for
531an internal data structure;
532.It Bq Er ETIMEDOUT
533when a connection was dropped
534due to excessive retransmissions;
535.It Bq Er ECONNRESET
536when the remote peer
537forces the connection to be closed;
538.It Bq Er ECONNREFUSED
539when the remote
540peer actively refuses connection establishment (usually because
541no process is listening to the port);
542.It Bq Er EADDRINUSE
543when an attempt
544is made to create a socket with a port which has already been
545allocated;
546.It Bq Er EADDRNOTAVAIL
547when an attempt is made to create a
548socket with a network address for which no network interface
549exists;
550.It Bq Er EAFNOSUPPORT
551when an attempt is made to bind or connect a socket to a multicast
552address.
553.El
554.Sh SEE ALSO
555.Xr getsockopt 2 ,
556.Xr socket 2 ,
557.Xr sysctl 3 ,
558.Xr blackhole 4 ,
559.Xr inet 4 ,
560.Xr intro 4 ,
561.Xr ip 4 ,
562.Xr mod_cc 4 ,
563.Xr syncache 4 ,
564.Xr setkey 8
565.Rs
566.%A "V. Jacobson"
567.%A "R. Braden"
568.%A "D. Borman"
569.%T "TCP Extensions for High Performance"
570.%O "RFC 1323"
571.Re
572.Rs
573.%A "A. Heffernan"
574.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
575.%O "RFC 2385"
576.Re
577.Rs
578.%A "K. Ramakrishnan"
579.%A "S. Floyd"
580.%A "D. Black"
581.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
582.%O "RFC 3168"
583.Re
584.Sh HISTORY
585The
586.Tn TCP
587protocol appeared in
588.Bx 4.2 .
589The RFC 1323 extensions for window scaling and timestamps were added
590in
591.Bx 4.4 .
592The
593.Dv TCP_INFO
594option was introduced in
595.Tn Linux 2.6
596and is
597.Em subject to change .
598