xref: /freebsd/share/man/man4/tcp.4 (revision 8655c70597b0e0918c82114b1186df5669b83eb6)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD$
34.\"
35.Dd August 16, 2008
36.Dt TCP 4
37.Os
38.Sh NAME
39.Nm tcp
40.Nd Internet Transmission Control Protocol
41.Sh SYNOPSIS
42.In sys/types.h
43.In sys/socket.h
44.In netinet/in.h
45.Ft int
46.Fn socket AF_INET SOCK_STREAM 0
47.Sh DESCRIPTION
48The
49.Tn TCP
50protocol provides reliable, flow-controlled, two-way
51transmission of data.
52It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.
56.Tn TCP
57uses the standard
58Internet address format and, in addition, provides a per-host
59collection of
60.Dq "port addresses" .
61Thus, each address is composed
62of an Internet address specifying the host and network,
63with a specific
64.Tn TCP
65port on the host identifying the peer entity.
66.Pp
67Sockets utilizing the
68.Tn TCP
69protocol are either
70.Dq active
71or
72.Dq passive .
73Active sockets initiate connections to passive
74sockets.
75By default,
76.Tn TCP
77sockets are created active; to create a
78passive socket, the
79.Xr listen 2
80system call must be used
81after binding the socket with the
82.Xr bind 2
83system call.
84Only passive sockets may use the
85.Xr accept 2
86call to accept incoming connections.
87Only active sockets may use the
88.Xr connect 2
89call to initiate connections.
90.Pp
91Passive sockets may
92.Dq underspecify
93their location to match
94incoming connection requests from multiple networks.
95This technique, termed
96.Dq "wildcard addressing" ,
97allows a single
98server to provide service to clients on multiple networks.
99To create a socket which listens on all networks, the Internet
100address
101.Dv INADDR_ANY
102must be bound.
103The
104.Tn TCP
105port may still be specified
106at this time; if the port is not specified, the system will assign one.
107Once a connection has been established, the socket's address is
108fixed by the peer entity's location.
109The address assigned to the
110socket is the address associated with the network interface
111through which packets are being transmitted and received.
112Normally, this address corresponds to the peer entity's network.
113.Pp
114.Tn TCP
115supports a number of socket options which can be set with
116.Xr setsockopt 2
117and tested with
118.Xr getsockopt 2 :
119.Bl -tag -width ".Dv TCP_NODELAY"
120.It Dv TCP_INFO
121Information about a socket's underlying TCP session may be retrieved
122by passing the read-only option
123.Dv TCP_INFO
124to
125.Xr getsockopt 2 .
126It accepts a single argument: a pointer to an instance of
127.Vt "struct tcp_info" .
128.Pp
129This API is subject to change; consult the source to determine
130which fields are currently filled out by this option.
131.Fx
132specific additions include
133send window size,
134receive window size,
135and
136bandwidth-controlled window space.
137.It Dv TCP_NODELAY
138Under most circumstances,
139.Tn TCP
140sends data when it is presented;
141when outstanding data has not yet been acknowledged, it gathers
142small amounts of output to be sent in a single packet once
143an acknowledgement is received.
144For a small number of clients, such as window systems
145that send a stream of mouse events which receive no replies,
146this packetization may cause significant delays.
147The boolean option
148.Dv TCP_NODELAY
149defeats this algorithm.
150.It Dv TCP_MAXSEG
151By default, a sender- and
152.No receiver- Ns Tn TCP
153will negotiate among themselves to determine the maximum segment size
154to be used for each connection.
155The
156.Dv TCP_MAXSEG
157option allows the user to determine the result of this negotiation,
158and to reduce it if desired.
159.It Dv TCP_NOOPT
160.Tn TCP
161usually sends a number of options in each packet, corresponding to
162various
163.Tn TCP
164extensions which are provided in this implementation.
165The boolean option
166.Dv TCP_NOOPT
167is provided to disable
168.Tn TCP
169option use on a per-connection basis.
170.It Dv TCP_NOPUSH
171By convention, the
172.No sender- Ns Tn TCP
173will set the
174.Dq push
175bit, and begin transmission immediately (if permitted) at the end of
176every user call to
177.Xr write 2
178or
179.Xr writev 2 .
180When this option is set to a non-zero value,
181.Tn TCP
182will delay sending any data at all until either the socket is closed,
183or the internal send buffer is filled.
184.It Dv TCP_MD5SIG
185This option enables the use of MD5 digests (also known as TCP-MD5)
186on writes to the specified socket.
187In the current release, only outgoing traffic is digested;
188digests on incoming traffic are not verified.
189The current default behavior for the system is to respond to a system
190advertising this option with TCP-MD5; this may change.
191.Pp
192One common use for this in a
193.Fx
194router deployment is to enable
195based routers to interwork with Cisco equipment at peering points.
196Support for this feature conforms to RFC 2385.
197Only IPv4
198.Pq Dv AF_INET
199sessions are supported.
200.Pp
201In order for this option to function correctly, it is necessary for the
202administrator to add a tcp-md5 key entry to the system's security
203associations database (SADB) using the
204.Xr setkey 8
205utility.
206This entry must have an SPI of 0x1000 and can therefore only be specified
207on a per-host basis at this time.
208.Pp
209If an SADB entry cannot be found for the destination, the outgoing traffic
210will have an invalid digest option prepended, and the following error message
211will be visible on the system console:
212.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
213.El
214.Pp
215The option level for the
216.Xr setsockopt 2
217call is the protocol number for
218.Tn TCP ,
219available from
220.Xr getprotobyname 3 ,
221or
222.Dv IPPROTO_TCP .
223All options are declared in
224.In netinet/tcp.h .
225.Pp
226Options at the
227.Tn IP
228transport level may be used with
229.Tn TCP ;
230see
231.Xr ip 4 .
232Incoming connection requests that are source-routed are noted,
233and the reverse source route is used in responding.
234.Ss MIB Variables
235The
236.Tn TCP
237protocol implements a number of variables in the
238.Va net.inet.tcp
239branch of the
240.Xr sysctl 3
241MIB.
242.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
243.It Dv TCPCTL_DO_RFC1323
244.Pq Va rfc1323
245Implement the window scaling and timestamp options of RFC 1323
246(default is true).
247.It Dv TCPCTL_MSSDFLT
248.Pq Va mssdflt
249The default value used for the maximum segment size
250.Pq Dq MSS
251when no advice to the contrary is received from MSS negotiation.
252.It Dv TCPCTL_SENDSPACE
253.Pq Va sendspace
254Maximum
255.Tn TCP
256send window.
257.It Dv TCPCTL_RECVSPACE
258.Pq Va recvspace
259Maximum
260.Tn TCP
261receive window.
262.It Va log_in_vain
263Log any connection attempts to ports where there is not a socket
264accepting connections.
265The value of 1 limits the logging to
266.Tn SYN
267(connection establishment) packets only.
268That of 2 results in any
269.Tn TCP
270packets to closed ports being logged.
271Any value unlisted above disables the logging
272(default is 0, i.e., the logging is disabled).
273.It Va slowstart_flightsize
274The number of packets allowed to be in-flight during the
275.Tn TCP
276slow-start phase on a non-local network.
277.It Va local_slowstart_flightsize
278The number of packets allowed to be in-flight during the
279.Tn TCP
280slow-start phase to local machines in the same subnet.
281.It Va msl
282The Maximum Segment Lifetime, in milliseconds, for a packet.
283.It Va keepinit
284Timeout, in milliseconds, for new, non-established
285.Tn TCP
286connections.
287.It Va keepidle
288Amount of time, in milliseconds, that the connection must be idle
289before keepalive probes (if enabled) are sent.
290.It Va keepintvl
291The interval, in milliseconds, between keepalive probes sent to remote
292machines.
293After
294.Dv TCPTV_KEEPCNT
295(default 8) probes are sent, with no response, the connection is dropped.
296.It Va always_keepalive
297Assume that
298.Dv SO_KEEPALIVE
299is set on all
300.Tn TCP
301connections, the kernel will
302periodically send a packet to the remote host to verify the connection
303is still up.
304.It Va icmp_may_rst
305Certain
306.Tn ICMP
307unreachable messages may abort connections in
308.Tn SYN-SENT
309state.
310.It Va do_tcpdrain
311Flush packets in the
312.Tn TCP
313reassembly queue if the system is low on mbufs.
314.It Va blackhole
315If enabled, disable sending of RST when a connection is attempted
316to a port where there is not a socket accepting connections.
317See
318.Xr blackhole 4 .
319.It Va delayed_ack
320Delay ACK to try and piggyback it onto a data packet.
321.It Va delacktime
322Maximum amount of time, in milliseconds, before a delayed ACK is sent.
323.It Va newreno
324Enable
325.Tn TCP
326NewReno Fast Recovery algorithm,
327as described in RFC 2582.
328.It Va path_mtu_discovery
329Enable Path MTU Discovery.
330.It Va tcbhashsize
331Size of the
332.Tn TCP
333control-block hash table
334(read-only).
335This may be tuned using the kernel option
336.Dv TCBHASHSIZE
337or by setting
338.Va net.inet.tcp.tcbhashsize
339in the
340.Xr loader 8 .
341.It Va pcbcount
342Number of active process control blocks
343(read-only).
344.It Va syncookies
345Determines whether or not
346.Tn SYN
347cookies should be generated for outbound
348.Tn SYN-ACK
349packets.
350.Tn SYN
351cookies are a great help during
352.Tn SYN
353flood attacks, and are enabled by default.
354(See
355.Xr syncookies 4 . )
356.It Va isn_reseed_interval
357The interval (in seconds) specifying how often the secret data used in
358RFC 1948 initial sequence number calculations should be reseeded.
359By default, this variable is set to zero, indicating that
360no reseeding will occur.
361Reseeding should not be necessary, and will break
362.Dv TIME_WAIT
363recycling for a few minutes.
364.It Va rexmit_min , rexmit_slop
365Adjust the retransmit timer calculation for
366.Tn TCP .
367The slop is
368typically added to the raw calculation to take into account
369occasional variances that the
370.Tn SRTT
371(smoothed round-trip time)
372is unable to accommodate, while the minimum specifies an
373absolute minimum.
374While a number of
375.Tn TCP
376RFCs suggest a 1
377second minimum, these RFCs tend to focus on streaming behavior,
378and fail to deal with the fact that a 1 second minimum has severe
379detrimental effects over lossy interactive connections, such
380as a 802.11b wireless link, and over very fast but lossy
381connections for those cases not covered by the fast retransmit
382code.
383For this reason, we use 200ms of slop and a near-0
384minimum, which gives us an effective minimum of 200ms (similar to
385.Tn Linux ) .
386.It Va inflight.enable
387Enable
388.Tn TCP
389bandwidth-delay product limiting.
390An attempt will be made to calculate
391the bandwidth-delay product for each individual
392.Tn TCP
393connection, and limit
394the amount of inflight data being transmitted, to avoid building up
395unnecessary packets in the network.
396This option is recommended if you
397are serving a lot of data over connections with high bandwidth-delay
398products, such as modems, GigE links, and fast long-haul WANs, and/or
399you have configured your machine to accommodate large
400.Tn TCP
401windows.
402In such
403situations, without this option, you may experience high interactive
404latencies or packet loss due to the overloading of intermediate routers
405and switches.
406Note that bandwidth-delay product limiting only effects
407the transmit side of a
408.Tn TCP
409connection.
410.It Va inflight.debug
411Enable debugging for the bandwidth-delay product algorithm.
412.It Va inflight.min
413This puts a lower bound on the bandwidth-delay product window, in bytes.
414A value of 1024 is typically used for debugging.
4156000-16000 is more typical in a production installation.
416Setting this value too low may result in
417slow ramp-up times for bursty connections.
418Setting this value too high effectively disables the algorithm.
419.It Va inflight.max
420This puts an upper bound on the bandwidth-delay product window, in bytes.
421This value should not generally be modified, but may be used to set a
422global per-connection limit on queued data, potentially allowing you to
423intentionally set a less than optimum limit, to smooth data flow over a
424network while still being able to specify huge internal
425.Tn TCP
426buffers.
427.It Va inflight.stab
428The bandwidth-delay product algorithm requires a slightly larger window
429than it otherwise calculates for stability.
430This parameter determines the extra window in maximal packets / 10.
431The default value of 20 represents 2 maximal packets.
432Reducing this value is not recommended, but you may
433come across a situation with very slow links where the
434.Xr ping 8
435time
436reduction of the default inflight code is not sufficient.
437If this case occurs, you should first try reducing
438.Va inflight.min
439and, if that does not
440work, reduce both
441.Va inflight.min
442and
443.Va inflight.stab ,
444trying values of
44515, 10, or 5 for the latter.
446Never use a value less than 5.
447Reducing
448.Va inflight.stab
449can lead to upwards of a 20% underutilization of the link
450as well as reducing the algorithm's ability to adapt to changing
451situations and should only be done as a last resort.
452.It Va rfc3042
453Enable the Limited Transmit algorithm as described in RFC 3042.
454It helps avoid timeouts on lossy links and also when the congestion window
455is small, as happens on short transfers.
456.It Va rfc3390
457Enable support for RFC 3390, which allows for a variable-sized
458starting congestion window on new connections, depending on the
459maximum segment size.
460This helps throughput in general, but
461particularly affects short transfers and high-bandwidth large
462propagation-delay connections.
463.Pp
464When this feature is enabled, the
465.Va slowstart_flightsize
466and
467.Va local_slowstart_flightsize
468settings are not observed for new
469connection slow starts, but they are still used for slow starts
470that occur when the connection has been idle and starts sending
471again.
472.It Va sack.enable
473Enable support for RFC 2018, TCP Selective Acknowledgment option,
474which allows the receiver to inform the sender about all successfully
475arrived segments, allowing the sender to retransmit the missing segments
476only.
477.It Va sack.maxholes
478Maximum number of SACK holes per connection.
479Defaults to 128.
480.It Va sack.globalmaxholes
481Maximum number of SACK holes per system, across all connections.
482Defaults to 65536.
483.It Va maxtcptw
484When a TCP connection enters the
485.Dv TIME_WAIT
486state, its associated socket structure is freed, since it is of
487negligible size and use, and a new structure is allocated to contain a
488minimal amount of information necessary for sustaining a connection in
489this state, called the compressed TCP TIME_WAIT state.
490Since this structure is smaller than a socket structure, it can save
491a significant amount of system memory.
492The
493.Va net.inet.tcp.maxtcptw
494MIB variable controls the maximum number of these structures allocated.
495By default, it is initialized to
496.Va kern.ipc.maxsockets
497/ 5.
498.It Va nolocaltimewait
499Suppress creating of compressed TCP TIME_WAIT states for connections in
500which both endpoints are local.
501.It Va fast_finwait2_recycle
502Recycle
503.Tn TCP
504.Dv FIN_WAIT_2
505connections faster when the socket is marked as
506.Dv SBS_CANTRCVMORE
507(no user process has the socket open, data received on
508the socket cannot be read).
509The timeout used here is
510.Va finwait2_timeout .
511.It Va finwait2_timeout
512Timeout to use for fast recycling of
513.Tn TCP
514.Dv FIN_WAIT_2
515connections.
516Defaults to 60 seconds.
517.It Va ecn.enable
518Enable support for TCP Explicit Congestion Notification (ECN).
519ECN allows a TCP sender to reduce the transmission rate in order to
520avoid packet drops.
521.It Va ecn.maxretries
522Number of retries (SYN or SYN/ACK retransmits) before disabling ECN on a
523specific connection. This is needed to help with connection establishment
524when a broken firewall is in the network path.
525.El
526.Sh ERRORS
527A socket operation may fail with one of the following errors returned:
528.Bl -tag -width Er
529.It Bq Er EISCONN
530when trying to establish a connection on a socket which
531already has one;
532.It Bq Er ENOBUFS
533when the system runs out of memory for
534an internal data structure;
535.It Bq Er ETIMEDOUT
536when a connection was dropped
537due to excessive retransmissions;
538.It Bq Er ECONNRESET
539when the remote peer
540forces the connection to be closed;
541.It Bq Er ECONNREFUSED
542when the remote
543peer actively refuses connection establishment (usually because
544no process is listening to the port);
545.It Bq Er EADDRINUSE
546when an attempt
547is made to create a socket with a port which has already been
548allocated;
549.It Bq Er EADDRNOTAVAIL
550when an attempt is made to create a
551socket with a network address for which no network interface
552exists;
553.It Bq Er EAFNOSUPPORT
554when an attempt is made to bind or connect a socket to a multicast
555address.
556.El
557.Sh SEE ALSO
558.Xr getsockopt 2 ,
559.Xr socket 2 ,
560.Xr sysctl 3 ,
561.Xr blackhole 4 ,
562.Xr inet 4 ,
563.Xr intro 4 ,
564.Xr ip 4 ,
565.Xr syncache 4 ,
566.Xr setkey 8
567.Rs
568.%A "V. Jacobson"
569.%A "R. Braden"
570.%A "D. Borman"
571.%T "TCP Extensions for High Performance"
572.%O "RFC 1323"
573.Re
574.Rs
575.%A "A. Heffernan"
576.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
577.%O "RFC 2385"
578.Re
579.Rs
580.%A "K. Ramakrishnan"
581.%A "S. Floyd"
582.%A "D. Black"
583.%T "The Addition of Explicit Congestion Notification (ECN) to IP"
584.%O "RFC 3168"
585.Re
586.Sh HISTORY
587The
588.Tn TCP
589protocol appeared in
590.Bx 4.2 .
591The RFC 1323 extensions for window scaling and timestamps were added
592in
593.Bx 4.4 .
594The
595.Dv TCP_INFO
596option was introduced in
597.Tn Linux 2.6
598and is
599.Em subject to change .
600