xref: /freebsd/share/man/man4/tcp.4 (revision a3cf0ef5a295c885c895fabfd56470c0d1db322d)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD$
34.\"
35.Dd September 16, 2010
36.Dt TCP 4
37.Os
38.Sh NAME
39.Nm tcp
40.Nd Internet Transmission Control Protocol
41.Sh SYNOPSIS
42.In sys/types.h
43.In sys/socket.h
44.In netinet/in.h
45.Ft int
46.Fn socket AF_INET SOCK_STREAM 0
47.Sh DESCRIPTION
48The
49.Tn TCP
50protocol provides reliable, flow-controlled, two-way
51transmission of data.
52It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.
56.Tn TCP
57uses the standard
58Internet address format and, in addition, provides a per-host
59collection of
60.Dq "port addresses" .
61Thus, each address is composed
62of an Internet address specifying the host and network,
63with a specific
64.Tn TCP
65port on the host identifying the peer entity.
66.Pp
67Sockets utilizing the
68.Tn TCP
69protocol are either
70.Dq active
71or
72.Dq passive .
73Active sockets initiate connections to passive
74sockets.
75By default,
76.Tn TCP
77sockets are created active; to create a
78passive socket, the
79.Xr listen 2
80system call must be used
81after binding the socket with the
82.Xr bind 2
83system call.
84Only passive sockets may use the
85.Xr accept 2
86call to accept incoming connections.
87Only active sockets may use the
88.Xr connect 2
89call to initiate connections.
90.Pp
91Passive sockets may
92.Dq underspecify
93their location to match
94incoming connection requests from multiple networks.
95This technique, termed
96.Dq "wildcard addressing" ,
97allows a single
98server to provide service to clients on multiple networks.
99To create a socket which listens on all networks, the Internet
100address
101.Dv INADDR_ANY
102must be bound.
103The
104.Tn TCP
105port may still be specified
106at this time; if the port is not specified, the system will assign one.
107Once a connection has been established, the socket's address is
108fixed by the peer entity's location.
109The address assigned to the
110socket is the address associated with the network interface
111through which packets are being transmitted and received.
112Normally, this address corresponds to the peer entity's network.
113.Pp
114.Tn TCP
115supports a number of socket options which can be set with
116.Xr setsockopt 2
117and tested with
118.Xr getsockopt 2 :
119.Bl -tag -width ".Dv TCP_NODELAY"
120.It Dv TCP_INFO
121Information about a socket's underlying TCP session may be retrieved
122by passing the read-only option
123.Dv TCP_INFO
124to
125.Xr getsockopt 2 .
126It accepts a single argument: a pointer to an instance of
127.Vt "struct tcp_info" .
128.Pp
129This API is subject to change; consult the source to determine
130which fields are currently filled out by this option.
131.Fx
132specific additions include
133send window size,
134receive window size,
135and
136bandwidth-controlled window space.
137.It Dv TCP_NODELAY
138Under most circumstances,
139.Tn TCP
140sends data when it is presented;
141when outstanding data has not yet been acknowledged, it gathers
142small amounts of output to be sent in a single packet once
143an acknowledgement is received.
144For a small number of clients, such as window systems
145that send a stream of mouse events which receive no replies,
146this packetization may cause significant delays.
147The boolean option
148.Dv TCP_NODELAY
149defeats this algorithm.
150.It Dv TCP_MAXSEG
151By default, a sender- and
152.No receiver- Ns Tn TCP
153will negotiate among themselves to determine the maximum segment size
154to be used for each connection.
155The
156.Dv TCP_MAXSEG
157option allows the user to determine the result of this negotiation,
158and to reduce it if desired.
159.It Dv TCP_NOOPT
160.Tn TCP
161usually sends a number of options in each packet, corresponding to
162various
163.Tn TCP
164extensions which are provided in this implementation.
165The boolean option
166.Dv TCP_NOOPT
167is provided to disable
168.Tn TCP
169option use on a per-connection basis.
170.It Dv TCP_NOPUSH
171By convention, the
172.No sender- Ns Tn TCP
173will set the
174.Dq push
175bit, and begin transmission immediately (if permitted) at the end of
176every user call to
177.Xr write 2
178or
179.Xr writev 2 .
180When this option is set to a non-zero value,
181.Tn TCP
182will delay sending any data at all until either the socket is closed,
183or the internal send buffer is filled.
184.It Dv TCP_MD5SIG
185This option enables the use of MD5 digests (also known as TCP-MD5)
186on writes to the specified socket.
187In the current release, only outgoing traffic is digested;
188digests on incoming traffic are not verified.
189The current default behavior for the system is to respond to a system
190advertising this option with TCP-MD5; this may change.
191.Pp
192One common use for this in a
193.Fx
194router deployment is to enable
195based routers to interwork with Cisco equipment at peering points.
196Support for this feature conforms to RFC 2385.
197Only IPv4
198.Pq Dv AF_INET
199sessions are supported.
200.Pp
201In order for this option to function correctly, it is necessary for the
202administrator to add a tcp-md5 key entry to the system's security
203associations database (SADB) using the
204.Xr setkey 8
205utility.
206This entry must have an SPI of 0x1000 and can therefore only be specified
207on a per-host basis at this time.
208.Pp
209If an SADB entry cannot be found for the destination, the outgoing traffic
210will have an invalid digest option prepended, and the following error message
211will be visible on the system console:
212.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
213.El
214.Pp
215The option level for the
216.Xr setsockopt 2
217call is the protocol number for
218.Tn TCP ,
219available from
220.Xr getprotobyname 3 ,
221or
222.Dv IPPROTO_TCP .
223All options are declared in
224.In netinet/tcp.h .
225.Pp
226Options at the
227.Tn IP
228transport level may be used with
229.Tn TCP ;
230see
231.Xr ip 4 .
232Incoming connection requests that are source-routed are noted,
233and the reverse source route is used in responding.
234.Ss MIB Variables
235The
236.Tn TCP
237protocol implements a number of variables in the
238.Va net.inet.tcp
239branch of the
240.Xr sysctl 3
241MIB.
242.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
243.It Dv TCPCTL_DO_RFC1323
244.Pq Va rfc1323
245Implement the window scaling and timestamp options of RFC 1323
246(default is true).
247.It Dv TCPCTL_MSSDFLT
248.Pq Va mssdflt
249The default value used for the maximum segment size
250.Pq Dq MSS
251when no advice to the contrary is received from MSS negotiation.
252.It Dv TCPCTL_SENDSPACE
253.Pq Va sendspace
254Maximum
255.Tn TCP
256send window.
257.It Dv TCPCTL_RECVSPACE
258.Pq Va recvspace
259Maximum
260.Tn TCP
261receive window.
262.It Va log_in_vain
263Log any connection attempts to ports where there is not a socket
264accepting connections.
265The value of 1 limits the logging to
266.Tn SYN
267(connection establishment) packets only.
268That of 2 results in any
269.Tn TCP
270packets to closed ports being logged.
271Any value unlisted above disables the logging
272(default is 0, i.e., the logging is disabled).
273.It Va slowstart_flightsize
274The number of packets allowed to be in-flight during the
275.Tn TCP
276slow-start phase on a non-local network.
277.It Va local_slowstart_flightsize
278The number of packets allowed to be in-flight during the
279.Tn TCP
280slow-start phase to local machines in the same subnet.
281.It Va msl
282The Maximum Segment Lifetime, in milliseconds, for a packet.
283.It Va keepinit
284Timeout, in milliseconds, for new, non-established
285.Tn TCP
286connections.
287.It Va keepidle
288Amount of time, in milliseconds, that the connection must be idle
289before keepalive probes (if enabled) are sent.
290.It Va keepintvl
291The interval, in milliseconds, between keepalive probes sent to remote
292machines.
293After
294.Dv TCPTV_KEEPCNT
295(default 8) probes are sent, with no response, the connection is dropped.
296.It Va always_keepalive
297Assume that
298.Dv SO_KEEPALIVE
299is set on all
300.Tn TCP
301connections, the kernel will
302periodically send a packet to the remote host to verify the connection
303is still up.
304.It Va icmp_may_rst
305Certain
306.Tn ICMP
307unreachable messages may abort connections in
308.Tn SYN-SENT
309state.
310.It Va do_tcpdrain
311Flush packets in the
312.Tn TCP
313reassembly queue if the system is low on mbufs.
314.It Va blackhole
315If enabled, disable sending of RST when a connection is attempted
316to a port where there is not a socket accepting connections.
317See
318.Xr blackhole 4 .
319.It Va delayed_ack
320Delay ACK to try and piggyback it onto a data packet.
321.It Va delacktime
322Maximum amount of time, in milliseconds, before a delayed ACK is sent.
323.It Va newreno
324Enable
325.Tn TCP
326NewReno Fast Recovery algorithm,
327as described in RFC 2582.
328.It Va path_mtu_discovery
329Enable Path MTU Discovery.
330.It Va tcbhashsize
331Size of the
332.Tn TCP
333control-block hash table
334(read-only).
335This may be tuned using the kernel option
336.Dv TCBHASHSIZE
337or by setting
338.Va net.inet.tcp.tcbhashsize
339in the
340.Xr loader 8 .
341.It Va pcbcount
342Number of active process control blocks
343(read-only).
344.It Va syncookies
345Determines whether or not
346.Tn SYN
347cookies should be generated for outbound
348.Tn SYN-ACK
349packets.
350.Tn SYN
351cookies are a great help during
352.Tn SYN
353flood attacks, and are enabled by default.
354(See
355.Xr syncookies 4 . )
356.It Va isn_reseed_interval
357The interval (in seconds) specifying how often the secret data used in
358RFC 1948 initial sequence number calculations should be reseeded.
359By default, this variable is set to zero, indicating that
360no reseeding will occur.
361Reseeding should not be necessary, and will break
362.Dv TIME_WAIT
363recycling for a few minutes.
364.It Va rexmit_min , rexmit_slop
365Adjust the retransmit timer calculation for
366.Tn TCP .
367The slop is
368typically added to the raw calculation to take into account
369occasional variances that the
370.Tn SRTT
371(smoothed round-trip time)
372is unable to accommodate, while the minimum specifies an
373absolute minimum.
374While a number of
375.Tn TCP
376RFCs suggest a 1
377second minimum, these RFCs tend to focus on streaming behavior,
378and fail to deal with the fact that a 1 second minimum has severe
379detrimental effects over lossy interactive connections, such
380as a 802.11b wireless link, and over very fast but lossy
381connections for those cases not covered by the fast retransmit
382code.
383For this reason, we use 200ms of slop and a near-0
384minimum, which gives us an effective minimum of 200ms (similar to
385.Tn Linux ) .
386.It Va rfc3042
387Enable the Limited Transmit algorithm as described in RFC 3042.
388It helps avoid timeouts on lossy links and also when the congestion window
389is small, as happens on short transfers.
390.It Va rfc3390
391Enable support for RFC 3390, which allows for a variable-sized
392starting congestion window on new connections, depending on the
393maximum segment size.
394This helps throughput in general, but
395particularly affects short transfers and high-bandwidth large
396propagation-delay connections.
397.Pp
398When this feature is enabled, the
399.Va slowstart_flightsize
400and
401.Va local_slowstart_flightsize
402settings are not observed for new
403connection slow starts, but they are still used for slow starts
404that occur when the connection has been idle and starts sending
405again.
406.It Va sack.enable
407Enable support for RFC 2018, TCP Selective Acknowledgment option,
408which allows the receiver to inform the sender about all successfully
409arrived segments, allowing the sender to retransmit the missing segments
410only.
411.It Va sack.maxholes
412Maximum number of SACK holes per connection.
413Defaults to 128.
414.It Va sack.globalmaxholes
415Maximum number of SACK holes per system, across all connections.
416Defaults to 65536.
417.It Va maxtcptw
418When a TCP connection enters the
419.Dv TIME_WAIT
420state, its associated socket structure is freed, since it is of
421negligible size and use, and a new structure is allocated to contain a
422minimal amount of information necessary for sustaining a connection in
423this state, called the compressed TCP TIME_WAIT state.
424Since this structure is smaller than a socket structure, it can save
425a significant amount of system memory.
426The
427.Va net.inet.tcp.maxtcptw
428MIB variable controls the maximum number of these structures allocated.
429By default, it is initialized to
430.Va kern.ipc.maxsockets
431/ 5.
432.It Va nolocaltimewait
433Suppress creating of compressed TCP TIME_WAIT states for connections in
434which both endpoints are local.
435.It Va fast_finwait2_recycle
436Recycle
437.Tn TCP
438.Dv FIN_WAIT_2
439connections faster when the socket is marked as
440.Dv SBS_CANTRCVMORE
441(no user process has the socket open, data received on
442the socket cannot be read).
443The timeout used here is
444.Va finwait2_timeout .
445.It Va finwait2_timeout
446Timeout to use for fast recycling of
447.Tn TCP
448.Dv FIN_WAIT_2
449connections.
450Defaults to 60 seconds.
451.It Va ecn.enable
452Enable support for TCP Explicit Congestion Notification (ECN).
453ECN allows a TCP sender to reduce the transmission rate in order to
454avoid packet drops.
455.It Va ecn.maxretries
456Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
457specific connection. This is needed to help with connection establishment
458when a broken firewall is in the network path.
459.El
460.Sh ERRORS
461A socket operation may fail with one of the following errors returned:
462.Bl -tag -width Er
463.It Bq Er EISCONN
464when trying to establish a connection on a socket which
465already has one;
466.It Bq Er ENOBUFS
467when the system runs out of memory for
468an internal data structure;
469.It Bq Er ETIMEDOUT
470when a connection was dropped
471due to excessive retransmissions;
472.It Bq Er ECONNRESET
473when the remote peer
474forces the connection to be closed;
475.It Bq Er ECONNREFUSED
476when the remote
477peer actively refuses connection establishment (usually because
478no process is listening to the port);
479.It Bq Er EADDRINUSE
480when an attempt
481is made to create a socket with a port which has already been
482allocated;
483.It Bq Er EADDRNOTAVAIL
484when an attempt is made to create a
485socket with a network address for which no network interface
486exists;
487.It Bq Er EAFNOSUPPORT
488when an attempt is made to bind or connect a socket to a multicast
489address.
490.El
491.Sh SEE ALSO
492.Xr getsockopt 2 ,
493.Xr socket 2 ,
494.Xr sysctl 3 ,
495.Xr blackhole 4 ,
496.Xr inet 4 ,
497.Xr intro 4 ,
498.Xr ip 4 ,
499.Xr syncache 4 ,
500.Xr setkey 8
501.Rs
502.%A "V. Jacobson"
503.%A "R. Braden"
504.%A "D. Borman"
505.%T "TCP Extensions for High Performance"
506.%O "RFC 1323"
507.Re
508.Rs
509.%A "A. Heffernan"
510.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
511.%O "RFC 2385"
512.Re
513.Rs
514.%A "K. Ramakrishnan"
515.%A "S. Floyd"
516.%A "D. Black"
517.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
518.%O "RFC 3168"
519.Re
520.Sh HISTORY
521The
522.Tn TCP
523protocol appeared in
524.Bx 4.2 .
525The RFC 1323 extensions for window scaling and timestamps were added
526in
527.Bx 4.4 .
528The
529.Dv TCP_INFO
530option was introduced in
531.Tn Linux 2.6
532and is
533.Em subject to change .
534