xref: /freebsd/share/man/man4/tcp.4 (revision f4bf2442a03f9b72cfe6d051766b650a4721f3d8)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.
3.\" Copyright (c) 2010-2011 The FreeBSD Foundation
4.\" All rights reserved.
5.\"
6.\" Portions of this documentation were written at the Centre for Advanced
7.\" Internet Architectures, Swinburne University of Technology, Melbourne,
8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. Neither the name of the University nor the names of its contributors
19.\"    may be used to endorse or promote products derived from this software
20.\"    without specific prior written permission.
21.\"
22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32.\" SUCH DAMAGE.
33.\"
34.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
35.\" $FreeBSD$
36.\"
37.Dd May 19, 2016
38.Dt TCP 4
39.Os
40.Sh NAME
41.Nm tcp
42.Nd Internet Transmission Control Protocol
43.Sh SYNOPSIS
44.In sys/types.h
45.In sys/socket.h
46.In netinet/in.h
47.In netinet/tcp.h
48.Ft int
49.Fn socket AF_INET SOCK_STREAM 0
50.Sh DESCRIPTION
51The
52.Tn TCP
53protocol provides reliable, flow-controlled, two-way
54transmission of data.
55It is a byte-stream protocol used to
56support the
57.Dv SOCK_STREAM
58abstraction.
59.Tn TCP
60uses the standard
61Internet address format and, in addition, provides a per-host
62collection of
63.Dq "port addresses" .
64Thus, each address is composed
65of an Internet address specifying the host and network,
66with a specific
67.Tn TCP
68port on the host identifying the peer entity.
69.Pp
70Sockets utilizing the
71.Tn TCP
72protocol are either
73.Dq active
74or
75.Dq passive .
76Active sockets initiate connections to passive
77sockets.
78By default,
79.Tn TCP
80sockets are created active; to create a
81passive socket, the
82.Xr listen 2
83system call must be used
84after binding the socket with the
85.Xr bind 2
86system call.
87Only passive sockets may use the
88.Xr accept 2
89call to accept incoming connections.
90Only active sockets may use the
91.Xr connect 2
92call to initiate connections.
93.Pp
94Passive sockets may
95.Dq underspecify
96their location to match
97incoming connection requests from multiple networks.
98This technique, termed
99.Dq "wildcard addressing" ,
100allows a single
101server to provide service to clients on multiple networks.
102To create a socket which listens on all networks, the Internet
103address
104.Dv INADDR_ANY
105must be bound.
106The
107.Tn TCP
108port may still be specified
109at this time; if the port is not specified, the system will assign one.
110Once a connection has been established, the socket's address is
111fixed by the peer entity's location.
112The address assigned to the
113socket is the address associated with the network interface
114through which packets are being transmitted and received.
115Normally, this address corresponds to the peer entity's network.
116.Pp
117.Tn TCP
118supports a number of socket options which can be set with
119.Xr setsockopt 2
120and tested with
121.Xr getsockopt 2 :
122.Bl -tag -width ".Dv TCP_CONGESTION"
123.It Dv TCP_INFO
124Information about a socket's underlying TCP session may be retrieved
125by passing the read-only option
126.Dv TCP_INFO
127to
128.Xr getsockopt 2 .
129It accepts a single argument: a pointer to an instance of
130.Vt "struct tcp_info" .
131.Pp
132This API is subject to change; consult the source to determine
133which fields are currently filled out by this option.
134.Fx
135specific additions include
136send window size,
137receive window size,
138and
139bandwidth-controlled window space.
140.It Dv TCP_CCALGOOPT
141Set or query congestion control algorithm specific parameters.
142See
143.Xr mod_cc 4
144for details.
145.It Dv TCP_CONGESTION
146Select or query the congestion control algorithm that TCP will use for the
147connection.
148See
149.Xr mod_cc 4
150for details.
151.It Dv TCP_KEEPINIT
152This
153.Xr setsockopt 2
154option accepts a per-socket timeout argument of
155.Vt "u_int"
156in seconds, for new, non-established
157.Tn TCP
158connections.
159For the global default in milliseconds see
160.Va keepinit
161in the
162.Sx MIB Variables
163section further down.
164.It Dv TCP_KEEPIDLE
165This
166.Xr setsockopt 2
167option accepts an argument of
168.Vt "u_int"
169for the amount of time, in seconds, that the connection must be idle
170before keepalive probes (if enabled) are sent for the connection of this
171socket.
172If set on a listening socket, the value is inherited by the newly created
173socket upon
174.Xr accept 2 .
175For the global default in milliseconds see
176.Va keepidle
177in the
178.Sx MIB Variables
179section further down.
180.It Dv TCP_KEEPINTVL
181This
182.Xr setsockopt 2
183option accepts an argument of
184.Vt "u_int"
185to set the per-socket interval, in seconds, between keepalive probes sent
186to a peer.
187If set on a listening socket, the value is inherited by the newly created
188socket upon
189.Xr accept 2 .
190For the global default in milliseconds see
191.Va keepintvl
192in the
193.Sx MIB Variables
194section further down.
195.It Dv TCP_KEEPCNT
196This
197.Xr setsockopt 2
198option accepts an argument of
199.Vt "u_int"
200and allows a per-socket tuning of the number of probes sent, with no response,
201before the connection will be dropped.
202If set on a listening socket, the value is inherited by the newly created
203socket upon
204.Xr accept 2 .
205For the global default see the
206.Va keepcnt
207in the
208.Sx MIB Variables
209section further down.
210.It Dv TCP_NODELAY
211Under most circumstances,
212.Tn TCP
213sends data when it is presented;
214when outstanding data has not yet been acknowledged, it gathers
215small amounts of output to be sent in a single packet once
216an acknowledgement is received.
217For a small number of clients, such as window systems
218that send a stream of mouse events which receive no replies,
219this packetization may cause significant delays.
220The boolean option
221.Dv TCP_NODELAY
222defeats this algorithm.
223.It Dv TCP_MAXSEG
224By default, a sender- and
225.No receiver- Ns Tn TCP
226will negotiate among themselves to determine the maximum segment size
227to be used for each connection.
228The
229.Dv TCP_MAXSEG
230option allows the user to determine the result of this negotiation,
231and to reduce it if desired.
232.It Dv TCP_NOOPT
233.Tn TCP
234usually sends a number of options in each packet, corresponding to
235various
236.Tn TCP
237extensions which are provided in this implementation.
238The boolean option
239.Dv TCP_NOOPT
240is provided to disable
241.Tn TCP
242option use on a per-connection basis.
243.It Dv TCP_NOPUSH
244By convention, the
245.No sender- Ns Tn TCP
246will set the
247.Dq push
248bit, and begin transmission immediately (if permitted) at the end of
249every user call to
250.Xr write 2
251or
252.Xr writev 2 .
253When this option is set to a non-zero value,
254.Tn TCP
255will delay sending any data at all until either the socket is closed,
256or the internal send buffer is filled.
257.It Dv TCP_MD5SIG
258This option enables the use of MD5 digests (also known as TCP-MD5)
259on writes to the specified socket.
260Outgoing traffic is digested;
261digests on incoming traffic are verified if the
262.Va net.inet.tcp.signature_verify_input
263sysctl is nonzero.
264The current default behavior for the system is to respond to a system
265advertising this option with TCP-MD5; this may change.
266.Pp
267One common use for this in a
268.Fx
269router deployment is to enable
270based routers to interwork with Cisco equipment at peering points.
271Support for this feature conforms to RFC 2385.
272Only IPv4
273.Pq Dv AF_INET
274sessions are supported.
275.Pp
276In order for this option to function correctly, it is necessary for the
277administrator to add a tcp-md5 key entry to the system's security
278associations database (SADB) using the
279.Xr setkey 8
280utility.
281This entry must have an SPI of 0x1000 and can therefore only be specified
282on a per-host basis at this time.
283.Pp
284If an SADB entry cannot be found for the destination, the outgoing traffic
285will have an invalid digest option prepended, and the following error message
286will be visible on the system console:
287.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
288.El
289.Pp
290The option level for the
291.Xr setsockopt 2
292call is the protocol number for
293.Tn TCP ,
294available from
295.Xr getprotobyname 3 ,
296or
297.Dv IPPROTO_TCP .
298All options are declared in
299.In netinet/tcp.h .
300.Pp
301Options at the
302.Tn IP
303transport level may be used with
304.Tn TCP ;
305see
306.Xr ip 4 .
307Incoming connection requests that are source-routed are noted,
308and the reverse source route is used in responding.
309.Pp
310The default congestion control algorithm for
311.Tn TCP
312is
313.Xr cc_newreno 4 .
314Other congestion control algorithms can be made available using the
315.Xr mod_cc 4
316framework.
317.Ss MIB Variables
318The
319.Tn TCP
320protocol implements a number of variables in the
321.Va net.inet.tcp
322branch of the
323.Xr sysctl 3
324MIB.
325.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
326.It Dv TCPCTL_DO_RFC1323
327.Pq Va rfc1323
328Implement the window scaling and timestamp options of RFC 1323
329(default is true).
330.It Dv TCPCTL_MSSDFLT
331.Pq Va mssdflt
332The default value used for the maximum segment size
333.Pq Dq MSS
334when no advice to the contrary is received from MSS negotiation.
335.It Dv TCPCTL_SENDSPACE
336.Pq Va sendspace
337Maximum
338.Tn TCP
339send window.
340.It Dv TCPCTL_RECVSPACE
341.Pq Va recvspace
342Maximum
343.Tn TCP
344receive window.
345.It Va log_in_vain
346Log any connection attempts to ports where there is not a socket
347accepting connections.
348The value of 1 limits the logging to
349.Tn SYN
350(connection establishment) packets only.
351That of 2 results in any
352.Tn TCP
353packets to closed ports being logged.
354Any value unlisted above disables the logging
355(default is 0, i.e., the logging is disabled).
356.It Va msl
357The Maximum Segment Lifetime, in milliseconds, for a packet.
358.It Va keepinit
359Timeout, in milliseconds, for new, non-established
360.Tn TCP
361connections.
362The default is 75000 msec.
363.It Va keepidle
364Amount of time, in milliseconds, that the connection must be idle
365before keepalive probes (if enabled) are sent.
366The default is 7200000 msec (2 hours).
367.It Va keepintvl
368The interval, in milliseconds, between keepalive probes sent to remote
369machines, when no response is received on a
370.Va keepidle
371probe.
372The default is 75000 msec.
373.It Va keepcnt
374Number of probes sent, with no response, before a connection
375is dropped.
376The default is 8 packets.
377.It Va always_keepalive
378Assume that
379.Dv SO_KEEPALIVE
380is set on all
381.Tn TCP
382connections, the kernel will
383periodically send a packet to the remote host to verify the connection
384is still up.
385.It Va icmp_may_rst
386Certain
387.Tn ICMP
388unreachable messages may abort connections in
389.Tn SYN-SENT
390state.
391.It Va do_tcpdrain
392Flush packets in the
393.Tn TCP
394reassembly queue if the system is low on mbufs.
395.It Va blackhole
396If enabled, disable sending of RST when a connection is attempted
397to a port where there is not a socket accepting connections.
398See
399.Xr blackhole 4 .
400.It Va delayed_ack
401Delay ACK to try and piggyback it onto a data packet.
402.It Va delacktime
403Maximum amount of time, in milliseconds, before a delayed ACK is sent.
404.It Va path_mtu_discovery
405Enable Path MTU Discovery.
406.It Va tcbhashsize
407Size of the
408.Tn TCP
409control-block hash table
410(read-only).
411This may be tuned using the kernel option
412.Dv TCBHASHSIZE
413or by setting
414.Va net.inet.tcp.tcbhashsize
415in the
416.Xr loader 8 .
417.It Va pcbcount
418Number of active process control blocks
419(read-only).
420.It Va syncookies
421Determines whether or not
422.Tn SYN
423cookies should be generated for outbound
424.Tn SYN-ACK
425packets.
426.Tn SYN
427cookies are a great help during
428.Tn SYN
429flood attacks, and are enabled by default.
430(See
431.Xr syncookies 4 . )
432.It Va isn_reseed_interval
433The interval (in seconds) specifying how often the secret data used in
434RFC 1948 initial sequence number calculations should be reseeded.
435By default, this variable is set to zero, indicating that
436no reseeding will occur.
437Reseeding should not be necessary, and will break
438.Dv TIME_WAIT
439recycling for a few minutes.
440.It Va rexmit_min , rexmit_slop
441Adjust the retransmit timer calculation for
442.Tn TCP .
443The slop is
444typically added to the raw calculation to take into account
445occasional variances that the
446.Tn SRTT
447(smoothed round-trip time)
448is unable to accommodate, while the minimum specifies an
449absolute minimum.
450While a number of
451.Tn TCP
452RFCs suggest a 1
453second minimum, these RFCs tend to focus on streaming behavior,
454and fail to deal with the fact that a 1 second minimum has severe
455detrimental effects over lossy interactive connections, such
456as a 802.11b wireless link, and over very fast but lossy
457connections for those cases not covered by the fast retransmit
458code.
459For this reason, we use 200ms of slop and a near-0
460minimum, which gives us an effective minimum of 200ms (similar to
461.Tn Linux ) .
462.It Va initcwnd_segments
463Enable the ability to specify initial congestion window in number of segments.
464The default value is 10 as suggested by RFC 6928.
465Changing the value on fly would not affect connections using congestion window
466from the hostcache.
467Caution:
468This regulates the burst of packets allowed to be sent in the first RTT.
469The value should be relative to the link capacity.
470Start with small values for lower-capacity links.
471Large bursts can cause buffer overruns and packet drops if routers have small
472buffers or the link is experiencing congestion.
473.It Va rfc3042
474Enable the Limited Transmit algorithm as described in RFC 3042.
475It helps avoid timeouts on lossy links and also when the congestion window
476is small, as happens on short transfers.
477.It Va rfc3390
478Enable support for RFC 3390, which allows for a variable-sized
479starting congestion window on new connections, depending on the
480maximum segment size.
481This helps throughput in general, but
482particularly affects short transfers and high-bandwidth large
483propagation-delay connections.
484.It Va sack.enable
485Enable support for RFC 2018, TCP Selective Acknowledgment option,
486which allows the receiver to inform the sender about all successfully
487arrived segments, allowing the sender to retransmit the missing segments
488only.
489.It Va sack.maxholes
490Maximum number of SACK holes per connection.
491Defaults to 128.
492.It Va sack.globalmaxholes
493Maximum number of SACK holes per system, across all connections.
494Defaults to 65536.
495.It Va maxtcptw
496When a TCP connection enters the
497.Dv TIME_WAIT
498state, its associated socket structure is freed, since it is of
499negligible size and use, and a new structure is allocated to contain a
500minimal amount of information necessary for sustaining a connection in
501this state, called the compressed TCP TIME_WAIT state.
502Since this structure is smaller than a socket structure, it can save
503a significant amount of system memory.
504The
505.Va net.inet.tcp.maxtcptw
506MIB variable controls the maximum number of these structures allocated.
507By default, it is initialized to
508.Va kern.ipc.maxsockets
509/ 5.
510.It Va nolocaltimewait
511Suppress creating of compressed TCP TIME_WAIT states for connections in
512which both endpoints are local.
513.It Va fast_finwait2_recycle
514Recycle
515.Tn TCP
516.Dv FIN_WAIT_2
517connections faster when the socket is marked as
518.Dv SBS_CANTRCVMORE
519(no user process has the socket open, data received on
520the socket cannot be read).
521The timeout used here is
522.Va finwait2_timeout .
523.It Va finwait2_timeout
524Timeout to use for fast recycling of
525.Tn TCP
526.Dv FIN_WAIT_2
527connections.
528Defaults to 60 seconds.
529.It Va ecn.enable
530Enable support for TCP Explicit Congestion Notification (ECN).
531ECN allows a TCP sender to reduce the transmission rate in order to
532avoid packet drops.
533Settings:
534.Bl -tag -compact
535.It 0
536Disable ECN.
537.It 1
538Allow incoming connections to request ECN.
539Outgoing connections will request ECN.
540.It 2
541Allow incoming connections to request ECN.
542Outgoing connections will not request ECN.
543.El
544.It Va ecn.maxretries
545Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
546specific connection.
547This is needed to help with connection establishment
548when a broken firewall is in the network path.
549.It Va pmtud_blackhole_detection
550Turn on automatic path MTU blackhole detection.
551In case of retransmits OS will
552lower the MSS to check if it's MTU problem.
553If current MSS is greater than
554configured value to try, it will be set to configured value, otherwise,
555MSS will be set to default values
556.Po Va net.inet.tcp.mssdflt
557and
558.Va net.inet.tcp.v6mssdflt
559.Pc .
560.It Va pmtud_blackhole_mss
561MSS to try for IPv4 if PMTU blackhole detection is turned on.
562.It Va v6pmtud_blackhole_mss
563MSS to try for IPv6 if PMTU blackhole detection is turned on.
564.It Va pmtud_blackhole_activated
565Number of times configured values were used in an attempt to downshift.
566.It Va pmtud_blackhole_activated_min_mss
567Number of times default MSS was used in an attempt to downshift.
568.It Va pmtud_blackhole_failed
569Number of connections for which retransmits continued even after MSS
570downshift.
571.El
572.Sh ERRORS
573A socket operation may fail with one of the following errors returned:
574.Bl -tag -width Er
575.It Bq Er EISCONN
576when trying to establish a connection on a socket which
577already has one;
578.It Bq Er ENOBUFS
579when the system runs out of memory for
580an internal data structure;
581.It Bq Er ETIMEDOUT
582when a connection was dropped
583due to excessive retransmissions;
584.It Bq Er ECONNRESET
585when the remote peer
586forces the connection to be closed;
587.It Bq Er ECONNREFUSED
588when the remote
589peer actively refuses connection establishment (usually because
590no process is listening to the port);
591.It Bq Er EADDRINUSE
592when an attempt
593is made to create a socket with a port which has already been
594allocated;
595.It Bq Er EADDRNOTAVAIL
596when an attempt is made to create a
597socket with a network address for which no network interface
598exists;
599.It Bq Er EAFNOSUPPORT
600when an attempt is made to bind or connect a socket to a multicast
601address.
602.El
603.Sh SEE ALSO
604.Xr getsockopt 2 ,
605.Xr socket 2 ,
606.Xr sysctl 3 ,
607.Xr blackhole 4 ,
608.Xr inet 4 ,
609.Xr intro 4 ,
610.Xr ip 4 ,
611.Xr mod_cc 4 ,
612.Xr siftr 4 ,
613.Xr syncache 4 ,
614.Xr setkey 8
615.Rs
616.%A "V. Jacobson"
617.%A "R. Braden"
618.%A "D. Borman"
619.%T "TCP Extensions for High Performance"
620.%O "RFC 1323"
621.Re
622.Rs
623.%A "A. Heffernan"
624.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
625.%O "RFC 2385"
626.Re
627.Rs
628.%A "K. Ramakrishnan"
629.%A "S. Floyd"
630.%A "D. Black"
631.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
632.%O "RFC 3168"
633.Re
634.Sh HISTORY
635The
636.Tn TCP
637protocol appeared in
638.Bx 4.2 .
639The RFC 1323 extensions for window scaling and timestamps were added
640in
641.Bx 4.4 .
642The
643.Dv TCP_INFO
644option was introduced in
645.Tn Linux 2.6
646and is
647.Em subject to change .
648