xref: /freebsd/share/man/man4/tcp.4 (revision 050570efa79efcc9cf5adeb545f1a679c8dc377b)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD$
34.\"
35.Dd January 8, 2011
36.Dt TCP 4
37.Os
38.Sh NAME
39.Nm tcp
40.Nd Internet Transmission Control Protocol
41.Sh SYNOPSIS
42.In sys/types.h
43.In sys/socket.h
44.In netinet/in.h
45.Ft int
46.Fn socket AF_INET SOCK_STREAM 0
47.Sh DESCRIPTION
48The
49.Tn TCP
50protocol provides reliable, flow-controlled, two-way
51transmission of data.
52It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.
56.Tn TCP
57uses the standard
58Internet address format and, in addition, provides a per-host
59collection of
60.Dq "port addresses" .
61Thus, each address is composed
62of an Internet address specifying the host and network,
63with a specific
64.Tn TCP
65port on the host identifying the peer entity.
66.Pp
67Sockets utilizing the
68.Tn TCP
69protocol are either
70.Dq active
71or
72.Dq passive .
73Active sockets initiate connections to passive
74sockets.
75By default,
76.Tn TCP
77sockets are created active; to create a
78passive socket, the
79.Xr listen 2
80system call must be used
81after binding the socket with the
82.Xr bind 2
83system call.
84Only passive sockets may use the
85.Xr accept 2
86call to accept incoming connections.
87Only active sockets may use the
88.Xr connect 2
89call to initiate connections.
90.Pp
91Passive sockets may
92.Dq underspecify
93their location to match
94incoming connection requests from multiple networks.
95This technique, termed
96.Dq "wildcard addressing" ,
97allows a single
98server to provide service to clients on multiple networks.
99To create a socket which listens on all networks, the Internet
100address
101.Dv INADDR_ANY
102must be bound.
103The
104.Tn TCP
105port may still be specified
106at this time; if the port is not specified, the system will assign one.
107Once a connection has been established, the socket's address is
108fixed by the peer entity's location.
109The address assigned to the
110socket is the address associated with the network interface
111through which packets are being transmitted and received.
112Normally, this address corresponds to the peer entity's network.
113.Pp
114.Tn TCP
115supports a number of socket options which can be set with
116.Xr setsockopt 2
117and tested with
118.Xr getsockopt 2 :
119.Bl -tag -width ".Dv TCP_NODELAY"
120.It Dv TCP_INFO
121Information about a socket's underlying TCP session may be retrieved
122by passing the read-only option
123.Dv TCP_INFO
124to
125.Xr getsockopt 2 .
126It accepts a single argument: a pointer to an instance of
127.Vt "struct tcp_info" .
128.Pp
129This API is subject to change; consult the source to determine
130which fields are currently filled out by this option.
131.Fx
132specific additions include
133send window size,
134receive window size,
135and
136bandwidth-controlled window space.
137.It Dv TCP_NODELAY
138Under most circumstances,
139.Tn TCP
140sends data when it is presented;
141when outstanding data has not yet been acknowledged, it gathers
142small amounts of output to be sent in a single packet once
143an acknowledgement is received.
144For a small number of clients, such as window systems
145that send a stream of mouse events which receive no replies,
146this packetization may cause significant delays.
147The boolean option
148.Dv TCP_NODELAY
149defeats this algorithm.
150.It Dv TCP_MAXSEG
151By default, a sender- and
152.No receiver- Ns Tn TCP
153will negotiate among themselves to determine the maximum segment size
154to be used for each connection.
155The
156.Dv TCP_MAXSEG
157option allows the user to determine the result of this negotiation,
158and to reduce it if desired.
159.It Dv TCP_NOOPT
160.Tn TCP
161usually sends a number of options in each packet, corresponding to
162various
163.Tn TCP
164extensions which are provided in this implementation.
165The boolean option
166.Dv TCP_NOOPT
167is provided to disable
168.Tn TCP
169option use on a per-connection basis.
170.It Dv TCP_NOPUSH
171By convention, the
172.No sender- Ns Tn TCP
173will set the
174.Dq push
175bit, and begin transmission immediately (if permitted) at the end of
176every user call to
177.Xr write 2
178or
179.Xr writev 2 .
180When this option is set to a non-zero value,
181.Tn TCP
182will delay sending any data at all until either the socket is closed,
183or the internal send buffer is filled.
184.It Dv TCP_MD5SIG
185This option enables the use of MD5 digests (also known as TCP-MD5)
186on writes to the specified socket.
187In the current release, only outgoing traffic is digested;
188digests on incoming traffic are not verified.
189The current default behavior for the system is to respond to a system
190advertising this option with TCP-MD5; this may change.
191.Pp
192One common use for this in a
193.Fx
194router deployment is to enable
195based routers to interwork with Cisco equipment at peering points.
196Support for this feature conforms to RFC 2385.
197Only IPv4
198.Pq Dv AF_INET
199sessions are supported.
200.Pp
201In order for this option to function correctly, it is necessary for the
202administrator to add a tcp-md5 key entry to the system's security
203associations database (SADB) using the
204.Xr setkey 8
205utility.
206This entry must have an SPI of 0x1000 and can therefore only be specified
207on a per-host basis at this time.
208.Pp
209If an SADB entry cannot be found for the destination, the outgoing traffic
210will have an invalid digest option prepended, and the following error message
211will be visible on the system console:
212.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
213.El
214.Pp
215The option level for the
216.Xr setsockopt 2
217call is the protocol number for
218.Tn TCP ,
219available from
220.Xr getprotobyname 3 ,
221or
222.Dv IPPROTO_TCP .
223All options are declared in
224.In netinet/tcp.h .
225.Pp
226Options at the
227.Tn IP
228transport level may be used with
229.Tn TCP ;
230see
231.Xr ip 4 .
232Incoming connection requests that are source-routed are noted,
233and the reverse source route is used in responding.
234.Ss MIB Variables
235The
236.Tn TCP
237protocol implements a number of variables in the
238.Va net.inet.tcp
239branch of the
240.Xr sysctl 3
241MIB.
242.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
243.It Dv TCPCTL_DO_RFC1323
244.Pq Va rfc1323
245Implement the window scaling and timestamp options of RFC 1323
246(default is true).
247.It Dv TCPCTL_MSSDFLT
248.Pq Va mssdflt
249The default value used for the maximum segment size
250.Pq Dq MSS
251when no advice to the contrary is received from MSS negotiation.
252.It Dv TCPCTL_SENDSPACE
253.Pq Va sendspace
254Maximum
255.Tn TCP
256send window.
257.It Dv TCPCTL_RECVSPACE
258.Pq Va recvspace
259Maximum
260.Tn TCP
261receive window.
262.It Va log_in_vain
263Log any connection attempts to ports where there is not a socket
264accepting connections.
265The value of 1 limits the logging to
266.Tn SYN
267(connection establishment) packets only.
268That of 2 results in any
269.Tn TCP
270packets to closed ports being logged.
271Any value unlisted above disables the logging
272(default is 0, i.e., the logging is disabled).
273.It Va slowstart_flightsize
274The number of packets allowed to be in-flight during the
275.Tn TCP
276slow-start phase on a non-local network.
277.It Va local_slowstart_flightsize
278The number of packets allowed to be in-flight during the
279.Tn TCP
280slow-start phase to local machines in the same subnet.
281.It Va msl
282The Maximum Segment Lifetime, in milliseconds, for a packet.
283.It Va keepinit
284Timeout, in milliseconds, for new, non-established
285.Tn TCP
286connections.
287.It Va keepidle
288Amount of time, in milliseconds, that the connection must be idle
289before keepalive probes (if enabled) are sent.
290.It Va keepintvl
291The interval, in milliseconds, between keepalive probes sent to remote
292machines, when no response is received on a
293.Va keepidle
294probe.
295After
296.Dv TCPTV_KEEPCNT
297(default 8) probes are sent, with no response, the connection is dropped.
298.It Va always_keepalive
299Assume that
300.Dv SO_KEEPALIVE
301is set on all
302.Tn TCP
303connections, the kernel will
304periodically send a packet to the remote host to verify the connection
305is still up.
306.It Va icmp_may_rst
307Certain
308.Tn ICMP
309unreachable messages may abort connections in
310.Tn SYN-SENT
311state.
312.It Va do_tcpdrain
313Flush packets in the
314.Tn TCP
315reassembly queue if the system is low on mbufs.
316.It Va blackhole
317If enabled, disable sending of RST when a connection is attempted
318to a port where there is not a socket accepting connections.
319See
320.Xr blackhole 4 .
321.It Va delayed_ack
322Delay ACK to try and piggyback it onto a data packet.
323.It Va delacktime
324Maximum amount of time, in milliseconds, before a delayed ACK is sent.
325.It Va newreno
326Enable
327.Tn TCP
328NewReno Fast Recovery algorithm,
329as described in RFC 2582.
330.It Va path_mtu_discovery
331Enable Path MTU Discovery.
332.It Va tcbhashsize
333Size of the
334.Tn TCP
335control-block hash table
336(read-only).
337This may be tuned using the kernel option
338.Dv TCBHASHSIZE
339or by setting
340.Va net.inet.tcp.tcbhashsize
341in the
342.Xr loader 8 .
343.It Va pcbcount
344Number of active process control blocks
345(read-only).
346.It Va syncookies
347Determines whether or not
348.Tn SYN
349cookies should be generated for outbound
350.Tn SYN-ACK
351packets.
352.Tn SYN
353cookies are a great help during
354.Tn SYN
355flood attacks, and are enabled by default.
356(See
357.Xr syncookies 4 . )
358.It Va isn_reseed_interval
359The interval (in seconds) specifying how often the secret data used in
360RFC 1948 initial sequence number calculations should be reseeded.
361By default, this variable is set to zero, indicating that
362no reseeding will occur.
363Reseeding should not be necessary, and will break
364.Dv TIME_WAIT
365recycling for a few minutes.
366.It Va rexmit_min , rexmit_slop
367Adjust the retransmit timer calculation for
368.Tn TCP .
369The slop is
370typically added to the raw calculation to take into account
371occasional variances that the
372.Tn SRTT
373(smoothed round-trip time)
374is unable to accommodate, while the minimum specifies an
375absolute minimum.
376While a number of
377.Tn TCP
378RFCs suggest a 1
379second minimum, these RFCs tend to focus on streaming behavior,
380and fail to deal with the fact that a 1 second minimum has severe
381detrimental effects over lossy interactive connections, such
382as a 802.11b wireless link, and over very fast but lossy
383connections for those cases not covered by the fast retransmit
384code.
385For this reason, we use 200ms of slop and a near-0
386minimum, which gives us an effective minimum of 200ms (similar to
387.Tn Linux ) .
388.It Va rfc3042
389Enable the Limited Transmit algorithm as described in RFC 3042.
390It helps avoid timeouts on lossy links and also when the congestion window
391is small, as happens on short transfers.
392.It Va rfc3390
393Enable support for RFC 3390, which allows for a variable-sized
394starting congestion window on new connections, depending on the
395maximum segment size.
396This helps throughput in general, but
397particularly affects short transfers and high-bandwidth large
398propagation-delay connections.
399.Pp
400When this feature is enabled, the
401.Va slowstart_flightsize
402and
403.Va local_slowstart_flightsize
404settings are not observed for new
405connection slow starts, but they are still used for slow starts
406that occur when the connection has been idle and starts sending
407again.
408.It Va sack.enable
409Enable support for RFC 2018, TCP Selective Acknowledgment option,
410which allows the receiver to inform the sender about all successfully
411arrived segments, allowing the sender to retransmit the missing segments
412only.
413.It Va sack.maxholes
414Maximum number of SACK holes per connection.
415Defaults to 128.
416.It Va sack.globalmaxholes
417Maximum number of SACK holes per system, across all connections.
418Defaults to 65536.
419.It Va maxtcptw
420When a TCP connection enters the
421.Dv TIME_WAIT
422state, its associated socket structure is freed, since it is of
423negligible size and use, and a new structure is allocated to contain a
424minimal amount of information necessary for sustaining a connection in
425this state, called the compressed TCP TIME_WAIT state.
426Since this structure is smaller than a socket structure, it can save
427a significant amount of system memory.
428The
429.Va net.inet.tcp.maxtcptw
430MIB variable controls the maximum number of these structures allocated.
431By default, it is initialized to
432.Va kern.ipc.maxsockets
433/ 5.
434.It Va nolocaltimewait
435Suppress creating of compressed TCP TIME_WAIT states for connections in
436which both endpoints are local.
437.It Va fast_finwait2_recycle
438Recycle
439.Tn TCP
440.Dv FIN_WAIT_2
441connections faster when the socket is marked as
442.Dv SBS_CANTRCVMORE
443(no user process has the socket open, data received on
444the socket cannot be read).
445The timeout used here is
446.Va finwait2_timeout .
447.It Va finwait2_timeout
448Timeout to use for fast recycling of
449.Tn TCP
450.Dv FIN_WAIT_2
451connections.
452Defaults to 60 seconds.
453.It Va ecn.enable
454Enable support for TCP Explicit Congestion Notification (ECN).
455ECN allows a TCP sender to reduce the transmission rate in order to
456avoid packet drops.
457.It Va ecn.maxretries
458Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
459specific connection. This is needed to help with connection establishment
460when a broken firewall is in the network path.
461.El
462.Sh ERRORS
463A socket operation may fail with one of the following errors returned:
464.Bl -tag -width Er
465.It Bq Er EISCONN
466when trying to establish a connection on a socket which
467already has one;
468.It Bq Er ENOBUFS
469when the system runs out of memory for
470an internal data structure;
471.It Bq Er ETIMEDOUT
472when a connection was dropped
473due to excessive retransmissions;
474.It Bq Er ECONNRESET
475when the remote peer
476forces the connection to be closed;
477.It Bq Er ECONNREFUSED
478when the remote
479peer actively refuses connection establishment (usually because
480no process is listening to the port);
481.It Bq Er EADDRINUSE
482when an attempt
483is made to create a socket with a port which has already been
484allocated;
485.It Bq Er EADDRNOTAVAIL
486when an attempt is made to create a
487socket with a network address for which no network interface
488exists;
489.It Bq Er EAFNOSUPPORT
490when an attempt is made to bind or connect a socket to a multicast
491address.
492.El
493.Sh SEE ALSO
494.Xr getsockopt 2 ,
495.Xr socket 2 ,
496.Xr sysctl 3 ,
497.Xr blackhole 4 ,
498.Xr inet 4 ,
499.Xr intro 4 ,
500.Xr ip 4 ,
501.Xr syncache 4 ,
502.Xr setkey 8
503.Rs
504.%A "V. Jacobson"
505.%A "R. Braden"
506.%A "D. Borman"
507.%T "TCP Extensions for High Performance"
508.%O "RFC 1323"
509.Re
510.Rs
511.%A "A. Heffernan"
512.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
513.%O "RFC 2385"
514.Re
515.Rs
516.%A "K. Ramakrishnan"
517.%A "S. Floyd"
518.%A "D. Black"
519.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
520.%O "RFC 3168"
521.Re
522.Sh HISTORY
523The
524.Tn TCP
525protocol appeared in
526.Bx 4.2 .
527The RFC 1323 extensions for window scaling and timestamps were added
528in
529.Bx 4.4 .
530The
531.Dv TCP_INFO
532option was introduced in
533.Tn Linux 2.6
534and is
535.Em subject to change .
536