xref: /freebsd/share/man/man4/tcp.4 (revision 2830819497fb2deae3dd71574592ace55f2fbdba)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.
3.\" Copyright (c) 2010-2011 The FreeBSD Foundation
4.\" All rights reserved.
5.\"
6.\" Portions of this documentation were written at the Centre for Advanced
7.\" Internet Architectures, Swinburne University of Technology, Melbourne,
8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. Neither the name of the University nor the names of its contributors
19.\"    may be used to endorse or promote products derived from this software
20.\"    without specific prior written permission.
21.\"
22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32.\" SUCH DAMAGE.
33.\"
34.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
35.\" $FreeBSD$
36.\"
37.Dd October 27, 2015
38.Dt TCP 4
39.Os
40.Sh NAME
41.Nm tcp
42.Nd Internet Transmission Control Protocol
43.Sh SYNOPSIS
44.In sys/types.h
45.In sys/socket.h
46.In netinet/in.h
47.In netinet/tcp.h
48.Ft int
49.Fn socket AF_INET SOCK_STREAM 0
50.Sh DESCRIPTION
51The
52.Tn TCP
53protocol provides reliable, flow-controlled, two-way
54transmission of data.
55It is a byte-stream protocol used to
56support the
57.Dv SOCK_STREAM
58abstraction.
59.Tn TCP
60uses the standard
61Internet address format and, in addition, provides a per-host
62collection of
63.Dq "port addresses" .
64Thus, each address is composed
65of an Internet address specifying the host and network,
66with a specific
67.Tn TCP
68port on the host identifying the peer entity.
69.Pp
70Sockets utilizing the
71.Tn TCP
72protocol are either
73.Dq active
74or
75.Dq passive .
76Active sockets initiate connections to passive
77sockets.
78By default,
79.Tn TCP
80sockets are created active; to create a
81passive socket, the
82.Xr listen 2
83system call must be used
84after binding the socket with the
85.Xr bind 2
86system call.
87Only passive sockets may use the
88.Xr accept 2
89call to accept incoming connections.
90Only active sockets may use the
91.Xr connect 2
92call to initiate connections.
93.Pp
94Passive sockets may
95.Dq underspecify
96their location to match
97incoming connection requests from multiple networks.
98This technique, termed
99.Dq "wildcard addressing" ,
100allows a single
101server to provide service to clients on multiple networks.
102To create a socket which listens on all networks, the Internet
103address
104.Dv INADDR_ANY
105must be bound.
106The
107.Tn TCP
108port may still be specified
109at this time; if the port is not specified, the system will assign one.
110Once a connection has been established, the socket's address is
111fixed by the peer entity's location.
112The address assigned to the
113socket is the address associated with the network interface
114through which packets are being transmitted and received.
115Normally, this address corresponds to the peer entity's network.
116.Pp
117.Tn TCP
118supports a number of socket options which can be set with
119.Xr setsockopt 2
120and tested with
121.Xr getsockopt 2 :
122.Bl -tag -width ".Dv TCP_CONGESTION"
123.It Dv TCP_INFO
124Information about a socket's underlying TCP session may be retrieved
125by passing the read-only option
126.Dv TCP_INFO
127to
128.Xr getsockopt 2 .
129It accepts a single argument: a pointer to an instance of
130.Vt "struct tcp_info" .
131.Pp
132This API is subject to change; consult the source to determine
133which fields are currently filled out by this option.
134.Fx
135specific additions include
136send window size,
137receive window size,
138and
139bandwidth-controlled window space.
140.It Dv TCP_CONGESTION
141Select or query the congestion control algorithm that TCP will use for the
142connection.
143See
144.Xr mod_cc 4
145for details.
146.It Dv TCP_KEEPINIT
147This
148.Xr setsockopt 2
149option accepts a per-socket timeout argument of
150.Vt "u_int"
151in seconds, for new, non-established
152.Tn TCP
153connections.
154For the global default in milliseconds see
155.Va keepinit
156in the
157.Sx MIB Variables
158section further down.
159.It Dv TCP_KEEPIDLE
160This
161.Xr setsockopt 2
162option accepts an argument of
163.Vt "u_int"
164for the amount of time, in seconds, that the connection must be idle
165before keepalive probes (if enabled) are sent for the connection of this
166socket.
167If set on a listening socket, the value is inherited by the newly created
168socket upon
169.Xr accept 2 .
170For the global default in milliseconds see
171.Va keepidle
172in the
173.Sx MIB Variables
174section further down.
175.It Dv TCP_KEEPINTVL
176This
177.Xr setsockopt 2
178option accepts an argument of
179.Vt "u_int"
180to set the per-socket interval, in seconds, between keepalive probes sent
181to a peer.
182If set on a listening socket, the value is inherited by the newly created
183socket upon
184.Xr accept 2 .
185For the global default in milliseconds see
186.Va keepintvl
187in the
188.Sx MIB Variables
189section further down.
190.It Dv TCP_KEEPCNT
191This
192.Xr setsockopt 2
193option accepts an argument of
194.Vt "u_int"
195and allows a per-socket tuning of the number of probes sent, with no response,
196before the connection will be dropped.
197If set on a listening socket, the value is inherited by the newly created
198socket upon
199.Xr accept 2 .
200For the global default see the
201.Va keepcnt
202in the
203.Sx MIB Variables
204section further down.
205.It Dv TCP_NODELAY
206Under most circumstances,
207.Tn TCP
208sends data when it is presented;
209when outstanding data has not yet been acknowledged, it gathers
210small amounts of output to be sent in a single packet once
211an acknowledgement is received.
212For a small number of clients, such as window systems
213that send a stream of mouse events which receive no replies,
214this packetization may cause significant delays.
215The boolean option
216.Dv TCP_NODELAY
217defeats this algorithm.
218.It Dv TCP_MAXSEG
219By default, a sender- and
220.No receiver- Ns Tn TCP
221will negotiate among themselves to determine the maximum segment size
222to be used for each connection.
223The
224.Dv TCP_MAXSEG
225option allows the user to determine the result of this negotiation,
226and to reduce it if desired.
227.It Dv TCP_NOOPT
228.Tn TCP
229usually sends a number of options in each packet, corresponding to
230various
231.Tn TCP
232extensions which are provided in this implementation.
233The boolean option
234.Dv TCP_NOOPT
235is provided to disable
236.Tn TCP
237option use on a per-connection basis.
238.It Dv TCP_NOPUSH
239By convention, the
240.No sender- Ns Tn TCP
241will set the
242.Dq push
243bit, and begin transmission immediately (if permitted) at the end of
244every user call to
245.Xr write 2
246or
247.Xr writev 2 .
248When this option is set to a non-zero value,
249.Tn TCP
250will delay sending any data at all until either the socket is closed,
251or the internal send buffer is filled.
252.It Dv TCP_MD5SIG
253This option enables the use of MD5 digests (also known as TCP-MD5)
254on writes to the specified socket.
255Outgoing traffic is digested;
256digests on incoming traffic are verified if the
257.Va net.inet.tcp.signature_verify_input
258sysctl is nonzero.
259The current default behavior for the system is to respond to a system
260advertising this option with TCP-MD5; this may change.
261.Pp
262One common use for this in a
263.Fx
264router deployment is to enable
265based routers to interwork with Cisco equipment at peering points.
266Support for this feature conforms to RFC 2385.
267Only IPv4
268.Pq Dv AF_INET
269sessions are supported.
270.Pp
271In order for this option to function correctly, it is necessary for the
272administrator to add a tcp-md5 key entry to the system's security
273associations database (SADB) using the
274.Xr setkey 8
275utility.
276This entry must have an SPI of 0x1000 and can therefore only be specified
277on a per-host basis at this time.
278.Pp
279If an SADB entry cannot be found for the destination, the outgoing traffic
280will have an invalid digest option prepended, and the following error message
281will be visible on the system console:
282.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
283.El
284.Pp
285The option level for the
286.Xr setsockopt 2
287call is the protocol number for
288.Tn TCP ,
289available from
290.Xr getprotobyname 3 ,
291or
292.Dv IPPROTO_TCP .
293All options are declared in
294.In netinet/tcp.h .
295.Pp
296Options at the
297.Tn IP
298transport level may be used with
299.Tn TCP ;
300see
301.Xr ip 4 .
302Incoming connection requests that are source-routed are noted,
303and the reverse source route is used in responding.
304.Pp
305The default congestion control algorithm for
306.Tn TCP
307is
308.Xr cc_newreno 4 .
309Other congestion control algorithms can be made available using the
310.Xr mod_cc 4
311framework.
312.Ss MIB Variables
313The
314.Tn TCP
315protocol implements a number of variables in the
316.Va net.inet.tcp
317branch of the
318.Xr sysctl 3
319MIB.
320.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
321.It Dv TCPCTL_DO_RFC1323
322.Pq Va rfc1323
323Implement the window scaling and timestamp options of RFC 1323
324(default is true).
325.It Dv TCPCTL_MSSDFLT
326.Pq Va mssdflt
327The default value used for the maximum segment size
328.Pq Dq MSS
329when no advice to the contrary is received from MSS negotiation.
330.It Dv TCPCTL_SENDSPACE
331.Pq Va sendspace
332Maximum
333.Tn TCP
334send window.
335.It Dv TCPCTL_RECVSPACE
336.Pq Va recvspace
337Maximum
338.Tn TCP
339receive window.
340.It Va log_in_vain
341Log any connection attempts to ports where there is not a socket
342accepting connections.
343The value of 1 limits the logging to
344.Tn SYN
345(connection establishment) packets only.
346That of 2 results in any
347.Tn TCP
348packets to closed ports being logged.
349Any value unlisted above disables the logging
350(default is 0, i.e., the logging is disabled).
351.It Va msl
352The Maximum Segment Lifetime, in milliseconds, for a packet.
353.It Va keepinit
354Timeout, in milliseconds, for new, non-established
355.Tn TCP
356connections.
357The default is 75000 msec.
358.It Va keepidle
359Amount of time, in milliseconds, that the connection must be idle
360before keepalive probes (if enabled) are sent.
361The default is 7200000 msec (2 hours).
362.It Va keepintvl
363The interval, in milliseconds, between keepalive probes sent to remote
364machines, when no response is received on a
365.Va keepidle
366probe.
367The default is 75000 msec.
368.It Va keepcnt
369Number of probes sent, with no response, before a connection
370is dropped.
371The default is 8 packets.
372.It Va always_keepalive
373Assume that
374.Dv SO_KEEPALIVE
375is set on all
376.Tn TCP
377connections, the kernel will
378periodically send a packet to the remote host to verify the connection
379is still up.
380.It Va icmp_may_rst
381Certain
382.Tn ICMP
383unreachable messages may abort connections in
384.Tn SYN-SENT
385state.
386.It Va do_tcpdrain
387Flush packets in the
388.Tn TCP
389reassembly queue if the system is low on mbufs.
390.It Va blackhole
391If enabled, disable sending of RST when a connection is attempted
392to a port where there is not a socket accepting connections.
393See
394.Xr blackhole 4 .
395.It Va delayed_ack
396Delay ACK to try and piggyback it onto a data packet.
397.It Va delacktime
398Maximum amount of time, in milliseconds, before a delayed ACK is sent.
399.It Va path_mtu_discovery
400Enable Path MTU Discovery.
401.It Va tcbhashsize
402Size of the
403.Tn TCP
404control-block hash table
405(read-only).
406This may be tuned using the kernel option
407.Dv TCBHASHSIZE
408or by setting
409.Va net.inet.tcp.tcbhashsize
410in the
411.Xr loader 8 .
412.It Va pcbcount
413Number of active process control blocks
414(read-only).
415.It Va syncookies
416Determines whether or not
417.Tn SYN
418cookies should be generated for outbound
419.Tn SYN-ACK
420packets.
421.Tn SYN
422cookies are a great help during
423.Tn SYN
424flood attacks, and are enabled by default.
425(See
426.Xr syncookies 4 . )
427.It Va isn_reseed_interval
428The interval (in seconds) specifying how often the secret data used in
429RFC 1948 initial sequence number calculations should be reseeded.
430By default, this variable is set to zero, indicating that
431no reseeding will occur.
432Reseeding should not be necessary, and will break
433.Dv TIME_WAIT
434recycling for a few minutes.
435.It Va rexmit_min , rexmit_slop
436Adjust the retransmit timer calculation for
437.Tn TCP .
438The slop is
439typically added to the raw calculation to take into account
440occasional variances that the
441.Tn SRTT
442(smoothed round-trip time)
443is unable to accommodate, while the minimum specifies an
444absolute minimum.
445While a number of
446.Tn TCP
447RFCs suggest a 1
448second minimum, these RFCs tend to focus on streaming behavior,
449and fail to deal with the fact that a 1 second minimum has severe
450detrimental effects over lossy interactive connections, such
451as a 802.11b wireless link, and over very fast but lossy
452connections for those cases not covered by the fast retransmit
453code.
454For this reason, we use 200ms of slop and a near-0
455minimum, which gives us an effective minimum of 200ms (similar to
456.Tn Linux ) .
457.It Va initcwnd_segments
458Enable the ability to specify initial congestion window in number of segments.
459The default value is 10 as suggested by RFC 6928.
460Changing the value on fly would not affect connections using congestion window
461from the hostcache.
462Caution:
463This regulates the burst of packets allowed to be sent in the first RTT.
464The value should be relative to the link capacity.
465Start with small values for lower-capacity links.
466Large bursts can cause buffer overruns and packet drops if routers have small
467buffers or the link is experiencing congestion.
468.It Va rfc3042
469Enable the Limited Transmit algorithm as described in RFC 3042.
470It helps avoid timeouts on lossy links and also when the congestion window
471is small, as happens on short transfers.
472.It Va rfc3390
473Enable support for RFC 3390, which allows for a variable-sized
474starting congestion window on new connections, depending on the
475maximum segment size.
476This helps throughput in general, but
477particularly affects short transfers and high-bandwidth large
478propagation-delay connections.
479.It Va sack.enable
480Enable support for RFC 2018, TCP Selective Acknowledgment option,
481which allows the receiver to inform the sender about all successfully
482arrived segments, allowing the sender to retransmit the missing segments
483only.
484.It Va sack.maxholes
485Maximum number of SACK holes per connection.
486Defaults to 128.
487.It Va sack.globalmaxholes
488Maximum number of SACK holes per system, across all connections.
489Defaults to 65536.
490.It Va maxtcptw
491When a TCP connection enters the
492.Dv TIME_WAIT
493state, its associated socket structure is freed, since it is of
494negligible size and use, and a new structure is allocated to contain a
495minimal amount of information necessary for sustaining a connection in
496this state, called the compressed TCP TIME_WAIT state.
497Since this structure is smaller than a socket structure, it can save
498a significant amount of system memory.
499The
500.Va net.inet.tcp.maxtcptw
501MIB variable controls the maximum number of these structures allocated.
502By default, it is initialized to
503.Va kern.ipc.maxsockets
504/ 5.
505.It Va nolocaltimewait
506Suppress creating of compressed TCP TIME_WAIT states for connections in
507which both endpoints are local.
508.It Va fast_finwait2_recycle
509Recycle
510.Tn TCP
511.Dv FIN_WAIT_2
512connections faster when the socket is marked as
513.Dv SBS_CANTRCVMORE
514(no user process has the socket open, data received on
515the socket cannot be read).
516The timeout used here is
517.Va finwait2_timeout .
518.It Va finwait2_timeout
519Timeout to use for fast recycling of
520.Tn TCP
521.Dv FIN_WAIT_2
522connections.
523Defaults to 60 seconds.
524.It Va ecn.enable
525Enable support for TCP Explicit Congestion Notification (ECN).
526ECN allows a TCP sender to reduce the transmission rate in order to
527avoid packet drops.
528.It Va ecn.maxretries
529Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
530specific connection.
531This is needed to help with connection establishment
532when a broken firewall is in the network path.
533.It Va pmtud_blackhole_detection
534Turn on automatic path MTU blackhole detection.
535In case of retransmits OS will
536lower the MSS to check if it's MTU problem.
537If current MSS is greater than
538configured value to try, it will be set to configured value, otherwise,
539MSS will be set to default values
540.Po Va net.inet.tcp.mssdflt
541and
542.Va net.inet.tcp.v6mssdflt
543.Pc .
544.It Va pmtud_blackhole_mss
545MSS to try for IPv4 if PMTU blackhole detection is turned on.
546.It Va v6pmtud_blackhole_mss
547MSS to try for IPv6 if PMTU blackhole detection is turned on.
548.It Va pmtud_blackhole_activated
549Number of times configured values were used in an attempt to downshift.
550.It Va pmtud_blackhole_activated_min_mss
551Number of times default MSS was used in an attempt to downshift.
552.It Va pmtud_blackhole_failed
553Number of connections for which retransmits continued even after MSS
554downshift.
555.El
556.Sh ERRORS
557A socket operation may fail with one of the following errors returned:
558.Bl -tag -width Er
559.It Bq Er EISCONN
560when trying to establish a connection on a socket which
561already has one;
562.It Bq Er ENOBUFS
563when the system runs out of memory for
564an internal data structure;
565.It Bq Er ETIMEDOUT
566when a connection was dropped
567due to excessive retransmissions;
568.It Bq Er ECONNRESET
569when the remote peer
570forces the connection to be closed;
571.It Bq Er ECONNREFUSED
572when the remote
573peer actively refuses connection establishment (usually because
574no process is listening to the port);
575.It Bq Er EADDRINUSE
576when an attempt
577is made to create a socket with a port which has already been
578allocated;
579.It Bq Er EADDRNOTAVAIL
580when an attempt is made to create a
581socket with a network address for which no network interface
582exists;
583.It Bq Er EAFNOSUPPORT
584when an attempt is made to bind or connect a socket to a multicast
585address.
586.El
587.Sh SEE ALSO
588.Xr getsockopt 2 ,
589.Xr socket 2 ,
590.Xr sysctl 3 ,
591.Xr blackhole 4 ,
592.Xr inet 4 ,
593.Xr intro 4 ,
594.Xr ip 4 ,
595.Xr mod_cc 4 ,
596.Xr siftr 4 ,
597.Xr syncache 4 ,
598.Xr setkey 8
599.Rs
600.%A "V. Jacobson"
601.%A "R. Braden"
602.%A "D. Borman"
603.%T "TCP Extensions for High Performance"
604.%O "RFC 1323"
605.Re
606.Rs
607.%A "A. Heffernan"
608.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
609.%O "RFC 2385"
610.Re
611.Rs
612.%A "K. Ramakrishnan"
613.%A "S. Floyd"
614.%A "D. Black"
615.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
616.%O "RFC 3168"
617.Re
618.Sh HISTORY
619The
620.Tn TCP
621protocol appeared in
622.Bx 4.2 .
623The RFC 1323 extensions for window scaling and timestamps were added
624in
625.Bx 4.4 .
626The
627.Dv TCP_INFO
628option was introduced in
629.Tn Linux 2.6
630and is
631.Em subject to change .
632