xref: /freebsd/share/man/man4/tcp.4 (revision 2f02600abfddfc4e9f20dd384a2e729b451e16bd)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.
3.\" Copyright (c) 2010-2011 The FreeBSD Foundation
4.\" All rights reserved.
5.\"
6.\" Portions of this documentation were written at the Centre for Advanced
7.\" Internet Architectures, Swinburne University of Technology, Melbourne,
8.\" Australia by David Hayes under sponsorship from the FreeBSD Foundation.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
39.\" $FreeBSD$
40.\"
41.Dd November 8, 2013
42.Dt TCP 4
43.Os
44.Sh NAME
45.Nm tcp
46.Nd Internet Transmission Control Protocol
47.Sh SYNOPSIS
48.In sys/types.h
49.In sys/socket.h
50.In netinet/in.h
51.In netinet/tcp.h
52.Ft int
53.Fn socket AF_INET SOCK_STREAM 0
54.Sh DESCRIPTION
55The
56.Tn TCP
57protocol provides reliable, flow-controlled, two-way
58transmission of data.
59It is a byte-stream protocol used to
60support the
61.Dv SOCK_STREAM
62abstraction.
63.Tn TCP
64uses the standard
65Internet address format and, in addition, provides a per-host
66collection of
67.Dq "port addresses" .
68Thus, each address is composed
69of an Internet address specifying the host and network,
70with a specific
71.Tn TCP
72port on the host identifying the peer entity.
73.Pp
74Sockets utilizing the
75.Tn TCP
76protocol are either
77.Dq active
78or
79.Dq passive .
80Active sockets initiate connections to passive
81sockets.
82By default,
83.Tn TCP
84sockets are created active; to create a
85passive socket, the
86.Xr listen 2
87system call must be used
88after binding the socket with the
89.Xr bind 2
90system call.
91Only passive sockets may use the
92.Xr accept 2
93call to accept incoming connections.
94Only active sockets may use the
95.Xr connect 2
96call to initiate connections.
97.Pp
98Passive sockets may
99.Dq underspecify
100their location to match
101incoming connection requests from multiple networks.
102This technique, termed
103.Dq "wildcard addressing" ,
104allows a single
105server to provide service to clients on multiple networks.
106To create a socket which listens on all networks, the Internet
107address
108.Dv INADDR_ANY
109must be bound.
110The
111.Tn TCP
112port may still be specified
113at this time; if the port is not specified, the system will assign one.
114Once a connection has been established, the socket's address is
115fixed by the peer entity's location.
116The address assigned to the
117socket is the address associated with the network interface
118through which packets are being transmitted and received.
119Normally, this address corresponds to the peer entity's network.
120.Pp
121.Tn TCP
122supports a number of socket options which can be set with
123.Xr setsockopt 2
124and tested with
125.Xr getsockopt 2 :
126.Bl -tag -width ".Dv TCP_CONGESTION"
127.It Dv TCP_INFO
128Information about a socket's underlying TCP session may be retrieved
129by passing the read-only option
130.Dv TCP_INFO
131to
132.Xr getsockopt 2 .
133It accepts a single argument: a pointer to an instance of
134.Vt "struct tcp_info" .
135.Pp
136This API is subject to change; consult the source to determine
137which fields are currently filled out by this option.
138.Fx
139specific additions include
140send window size,
141receive window size,
142and
143bandwidth-controlled window space.
144.It Dv TCP_CONGESTION
145Select or query the congestion control algorithm that TCP will use for the
146connection.
147See
148.Xr mod_cc 4
149for details.
150.It Dv TCP_KEEPINIT
151This
152.Xr setsockopt 2
153option accepts a per-socket timeout argument of
154.Vt "u_int"
155in seconds, for new, non-established
156.Tn TCP
157connections.
158For the global default in milliseconds see
159.Va keepinit
160in the
161.Sx MIB Variables
162section further down.
163.It Dv TCP_KEEPIDLE
164This
165.Xr setsockopt 2
166option accepts an argument of
167.Vt "u_int"
168for the amount of time, in seconds, that the connection must be idle
169before keepalive probes (if enabled) are sent for the connection of this
170socket.
171If set on a listening socket, the value is inherited by the newly created
172socket upon
173.Xr accept 2 .
174For the global default in milliseconds see
175.Va keepidle
176in the
177.Sx MIB Variables
178section further down.
179.It Dv TCP_KEEPINTVL
180This
181.Xr setsockopt 2
182option accepts an argument of
183.Vt "u_int"
184to set the per-socket interval, in seconds, between keepalive probes sent
185to a peer.
186If set on a listening socket, the value is inherited by the newly created
187socket upon
188.Xr accept 2 .
189For the global default in milliseconds see
190.Va keepintvl
191in the
192.Sx MIB Variables
193section further down.
194.It Dv TCP_KEEPCNT
195This
196.Xr setsockopt 2
197option accepts an argument of
198.Vt "u_int"
199and allows a per-socket tuning of the number of probes sent, with no response,
200before the connection will be dropped.
201If set on a listening socket, the value is inherited by the newly created
202socket upon
203.Xr accept 2 .
204For the global default see the
205.Va keepcnt
206in the
207.Sx MIB Variables
208section further down.
209.It Dv TCP_NODELAY
210Under most circumstances,
211.Tn TCP
212sends data when it is presented;
213when outstanding data has not yet been acknowledged, it gathers
214small amounts of output to be sent in a single packet once
215an acknowledgement is received.
216For a small number of clients, such as window systems
217that send a stream of mouse events which receive no replies,
218this packetization may cause significant delays.
219The boolean option
220.Dv TCP_NODELAY
221defeats this algorithm.
222.It Dv TCP_MAXSEG
223By default, a sender- and
224.No receiver- Ns Tn TCP
225will negotiate among themselves to determine the maximum segment size
226to be used for each connection.
227The
228.Dv TCP_MAXSEG
229option allows the user to determine the result of this negotiation,
230and to reduce it if desired.
231.It Dv TCP_NOOPT
232.Tn TCP
233usually sends a number of options in each packet, corresponding to
234various
235.Tn TCP
236extensions which are provided in this implementation.
237The boolean option
238.Dv TCP_NOOPT
239is provided to disable
240.Tn TCP
241option use on a per-connection basis.
242.It Dv TCP_NOPUSH
243By convention, the
244.No sender- Ns Tn TCP
245will set the
246.Dq push
247bit, and begin transmission immediately (if permitted) at the end of
248every user call to
249.Xr write 2
250or
251.Xr writev 2 .
252When this option is set to a non-zero value,
253.Tn TCP
254will delay sending any data at all until either the socket is closed,
255or the internal send buffer is filled.
256.It Dv TCP_MD5SIG
257This option enables the use of MD5 digests (also known as TCP-MD5)
258on writes to the specified socket.
259Outgoing traffic is digested;
260digests on incoming traffic are verified if the
261.Va net.inet.tcp.signature_verify_input
262sysctl is nonzero.
263The current default behavior for the system is to respond to a system
264advertising this option with TCP-MD5; this may change.
265.Pp
266One common use for this in a
267.Fx
268router deployment is to enable
269based routers to interwork with Cisco equipment at peering points.
270Support for this feature conforms to RFC 2385.
271Only IPv4
272.Pq Dv AF_INET
273sessions are supported.
274.Pp
275In order for this option to function correctly, it is necessary for the
276administrator to add a tcp-md5 key entry to the system's security
277associations database (SADB) using the
278.Xr setkey 8
279utility.
280This entry must have an SPI of 0x1000 and can therefore only be specified
281on a per-host basis at this time.
282.Pp
283If an SADB entry cannot be found for the destination, the outgoing traffic
284will have an invalid digest option prepended, and the following error message
285will be visible on the system console:
286.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
287.El
288.Pp
289The option level for the
290.Xr setsockopt 2
291call is the protocol number for
292.Tn TCP ,
293available from
294.Xr getprotobyname 3 ,
295or
296.Dv IPPROTO_TCP .
297All options are declared in
298.In netinet/tcp.h .
299.Pp
300Options at the
301.Tn IP
302transport level may be used with
303.Tn TCP ;
304see
305.Xr ip 4 .
306Incoming connection requests that are source-routed are noted,
307and the reverse source route is used in responding.
308.Pp
309The default congestion control algorithm for
310.Tn TCP
311is
312.Xr cc_newreno 4 .
313Other congestion control algorithms can be made available using the
314.Xr mod_cc 4
315framework.
316.Ss MIB Variables
317The
318.Tn TCP
319protocol implements a number of variables in the
320.Va net.inet.tcp
321branch of the
322.Xr sysctl 3
323MIB.
324.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
325.It Dv TCPCTL_DO_RFC1323
326.Pq Va rfc1323
327Implement the window scaling and timestamp options of RFC 1323
328(default is true).
329.It Dv TCPCTL_MSSDFLT
330.Pq Va mssdflt
331The default value used for the maximum segment size
332.Pq Dq MSS
333when no advice to the contrary is received from MSS negotiation.
334.It Dv TCPCTL_SENDSPACE
335.Pq Va sendspace
336Maximum
337.Tn TCP
338send window.
339.It Dv TCPCTL_RECVSPACE
340.Pq Va recvspace
341Maximum
342.Tn TCP
343receive window.
344.It Va log_in_vain
345Log any connection attempts to ports where there is not a socket
346accepting connections.
347The value of 1 limits the logging to
348.Tn SYN
349(connection establishment) packets only.
350That of 2 results in any
351.Tn TCP
352packets to closed ports being logged.
353Any value unlisted above disables the logging
354(default is 0, i.e., the logging is disabled).
355.It Va msl
356The Maximum Segment Lifetime, in milliseconds, for a packet.
357.It Va keepinit
358Timeout, in milliseconds, for new, non-established
359.Tn TCP
360connections.
361The default is 75000 msec.
362.It Va keepidle
363Amount of time, in milliseconds, that the connection must be idle
364before keepalive probes (if enabled) are sent.
365The default is 7200000 msec (2 hours).
366.It Va keepintvl
367The interval, in milliseconds, between keepalive probes sent to remote
368machines, when no response is received on a
369.Va keepidle
370probe.
371The default is 75000 msec.
372.It Va keepcnt
373Number of probes sent, with no response, before a connection
374is dropped.
375The default is 8 packets.
376.It Va always_keepalive
377Assume that
378.Dv SO_KEEPALIVE
379is set on all
380.Tn TCP
381connections, the kernel will
382periodically send a packet to the remote host to verify the connection
383is still up.
384.It Va icmp_may_rst
385Certain
386.Tn ICMP
387unreachable messages may abort connections in
388.Tn SYN-SENT
389state.
390.It Va do_tcpdrain
391Flush packets in the
392.Tn TCP
393reassembly queue if the system is low on mbufs.
394.It Va blackhole
395If enabled, disable sending of RST when a connection is attempted
396to a port where there is not a socket accepting connections.
397See
398.Xr blackhole 4 .
399.It Va delayed_ack
400Delay ACK to try and piggyback it onto a data packet.
401.It Va delacktime
402Maximum amount of time, in milliseconds, before a delayed ACK is sent.
403.It Va path_mtu_discovery
404Enable Path MTU Discovery.
405.It Va tcbhashsize
406Size of the
407.Tn TCP
408control-block hash table
409(read-only).
410This may be tuned using the kernel option
411.Dv TCBHASHSIZE
412or by setting
413.Va net.inet.tcp.tcbhashsize
414in the
415.Xr loader 8 .
416.It Va pcbcount
417Number of active process control blocks
418(read-only).
419.It Va syncookies
420Determines whether or not
421.Tn SYN
422cookies should be generated for outbound
423.Tn SYN-ACK
424packets.
425.Tn SYN
426cookies are a great help during
427.Tn SYN
428flood attacks, and are enabled by default.
429(See
430.Xr syncookies 4 . )
431.It Va isn_reseed_interval
432The interval (in seconds) specifying how often the secret data used in
433RFC 1948 initial sequence number calculations should be reseeded.
434By default, this variable is set to zero, indicating that
435no reseeding will occur.
436Reseeding should not be necessary, and will break
437.Dv TIME_WAIT
438recycling for a few minutes.
439.It Va rexmit_min , rexmit_slop
440Adjust the retransmit timer calculation for
441.Tn TCP .
442The slop is
443typically added to the raw calculation to take into account
444occasional variances that the
445.Tn SRTT
446(smoothed round-trip time)
447is unable to accommodate, while the minimum specifies an
448absolute minimum.
449While a number of
450.Tn TCP
451RFCs suggest a 1
452second minimum, these RFCs tend to focus on streaming behavior,
453and fail to deal with the fact that a 1 second minimum has severe
454detrimental effects over lossy interactive connections, such
455as a 802.11b wireless link, and over very fast but lossy
456connections for those cases not covered by the fast retransmit
457code.
458For this reason, we use 200ms of slop and a near-0
459minimum, which gives us an effective minimum of 200ms (similar to
460.Tn Linux ) .
461.It Va rfc3042
462Enable the Limited Transmit algorithm as described in RFC 3042.
463It helps avoid timeouts on lossy links and also when the congestion window
464is small, as happens on short transfers.
465.It Va rfc3390
466Enable support for RFC 3390, which allows for a variable-sized
467starting congestion window on new connections, depending on the
468maximum segment size.
469This helps throughput in general, but
470particularly affects short transfers and high-bandwidth large
471propagation-delay connections.
472.It Va sack.enable
473Enable support for RFC 2018, TCP Selective Acknowledgment option,
474which allows the receiver to inform the sender about all successfully
475arrived segments, allowing the sender to retransmit the missing segments
476only.
477.It Va sack.maxholes
478Maximum number of SACK holes per connection.
479Defaults to 128.
480.It Va sack.globalmaxholes
481Maximum number of SACK holes per system, across all connections.
482Defaults to 65536.
483.It Va maxtcptw
484When a TCP connection enters the
485.Dv TIME_WAIT
486state, its associated socket structure is freed, since it is of
487negligible size and use, and a new structure is allocated to contain a
488minimal amount of information necessary for sustaining a connection in
489this state, called the compressed TCP TIME_WAIT state.
490Since this structure is smaller than a socket structure, it can save
491a significant amount of system memory.
492The
493.Va net.inet.tcp.maxtcptw
494MIB variable controls the maximum number of these structures allocated.
495By default, it is initialized to
496.Va kern.ipc.maxsockets
497/ 5.
498.It Va nolocaltimewait
499Suppress creating of compressed TCP TIME_WAIT states for connections in
500which both endpoints are local.
501.It Va fast_finwait2_recycle
502Recycle
503.Tn TCP
504.Dv FIN_WAIT_2
505connections faster when the socket is marked as
506.Dv SBS_CANTRCVMORE
507(no user process has the socket open, data received on
508the socket cannot be read).
509The timeout used here is
510.Va finwait2_timeout .
511.It Va finwait2_timeout
512Timeout to use for fast recycling of
513.Tn TCP
514.Dv FIN_WAIT_2
515connections.
516Defaults to 60 seconds.
517.It Va ecn.enable
518Enable support for TCP Explicit Congestion Notification (ECN).
519ECN allows a TCP sender to reduce the transmission rate in order to
520avoid packet drops.
521.It Va ecn.maxretries
522Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
523specific connection. This is needed to help with connection establishment
524when a broken firewall is in the network path.
525.El
526.Sh ERRORS
527A socket operation may fail with one of the following errors returned:
528.Bl -tag -width Er
529.It Bq Er EISCONN
530when trying to establish a connection on a socket which
531already has one;
532.It Bq Er ENOBUFS
533when the system runs out of memory for
534an internal data structure;
535.It Bq Er ETIMEDOUT
536when a connection was dropped
537due to excessive retransmissions;
538.It Bq Er ECONNRESET
539when the remote peer
540forces the connection to be closed;
541.It Bq Er ECONNREFUSED
542when the remote
543peer actively refuses connection establishment (usually because
544no process is listening to the port);
545.It Bq Er EADDRINUSE
546when an attempt
547is made to create a socket with a port which has already been
548allocated;
549.It Bq Er EADDRNOTAVAIL
550when an attempt is made to create a
551socket with a network address for which no network interface
552exists;
553.It Bq Er EAFNOSUPPORT
554when an attempt is made to bind or connect a socket to a multicast
555address.
556.El
557.Sh SEE ALSO
558.Xr getsockopt 2 ,
559.Xr socket 2 ,
560.Xr sysctl 3 ,
561.Xr blackhole 4 ,
562.Xr inet 4 ,
563.Xr intro 4 ,
564.Xr ip 4 ,
565.Xr mod_cc 4 ,
566.Xr siftr 4 ,
567.Xr syncache 4 ,
568.Xr setkey 8
569.Rs
570.%A "V. Jacobson"
571.%A "R. Braden"
572.%A "D. Borman"
573.%T "TCP Extensions for High Performance"
574.%O "RFC 1323"
575.Re
576.Rs
577.%A "A. Heffernan"
578.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
579.%O "RFC 2385"
580.Re
581.Rs
582.%A "K. Ramakrishnan"
583.%A "S. Floyd"
584.%A "D. Black"
585.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
586.%O "RFC 3168"
587.Re
588.Sh HISTORY
589The
590.Tn TCP
591protocol appeared in
592.Bx 4.2 .
593The RFC 1323 extensions for window scaling and timestamps were added
594in
595.Bx 4.4 .
596The
597.Dv TCP_INFO
598option was introduced in
599.Tn Linux 2.6
600and is
601.Em subject to change .
602