xref: /freebsd/share/man/man4/tcp.4 (revision d139ce67c0b39ab6532275f7baff67d220fe8001)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD$
34.\"
35.Dd January 22, 2007
36.Dt TCP 4
37.Os
38.Sh NAME
39.Nm tcp
40.Nd Internet Transmission Control Protocol
41.Sh SYNOPSIS
42.In sys/types.h
43.In sys/socket.h
44.In netinet/in.h
45.Ft int
46.Fn socket AF_INET SOCK_STREAM 0
47.Sh DESCRIPTION
48The
49.Tn TCP
50protocol provides reliable, flow-controlled, two-way
51transmission of data.
52It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.
56.Tn TCP
57uses the standard
58Internet address format and, in addition, provides a per-host
59collection of
60.Dq "port addresses" .
61Thus, each address is composed
62of an Internet address specifying the host and network,
63with a specific
64.Tn TCP
65port on the host identifying the peer entity.
66.Pp
67Sockets utilizing the
68.Tn TCP
69protocol are either
70.Dq active
71or
72.Dq passive .
73Active sockets initiate connections to passive
74sockets.
75By default,
76.Tn TCP
77sockets are created active; to create a
78passive socket, the
79.Xr listen 2
80system call must be used
81after binding the socket with the
82.Xr bind 2
83system call.
84Only passive sockets may use the
85.Xr accept 2
86call to accept incoming connections.
87Only active sockets may use the
88.Xr connect 2
89call to initiate connections.
90.Pp
91Passive sockets may
92.Dq underspecify
93their location to match
94incoming connection requests from multiple networks.
95This technique, termed
96.Dq "wildcard addressing" ,
97allows a single
98server to provide service to clients on multiple networks.
99To create a socket which listens on all networks, the Internet
100address
101.Dv INADDR_ANY
102must be bound.
103The
104.Tn TCP
105port may still be specified
106at this time; if the port is not specified, the system will assign one.
107Once a connection has been established, the socket's address is
108fixed by the peer entity's location.
109The address assigned to the
110socket is the address associated with the network interface
111through which packets are being transmitted and received.
112Normally, this address corresponds to the peer entity's network.
113.Pp
114.Tn TCP
115supports a number of socket options which can be set with
116.Xr setsockopt 2
117and tested with
118.Xr getsockopt 2 :
119.Bl -tag -width ".Dv TCP_NODELAY"
120.It Dv TCP_INFO
121Information about a socket's underlying TCP session may be retrieved
122by passing the read-only option
123.Dv TCP_INFO
124to
125.Xr getsockopt 2 .
126It accepts a single argument: a pointer to an instance of
127.Vt "struct tcp_info" .
128.Pp
129This API is subject to change; consult the source to determine
130which fields are currently filled out by this option.
131.Fx
132specific additions include
133send window size,
134receive window size,
135and
136bandwidth-controlled window space.
137.It Dv TCP_NODELAY
138Under most circumstances,
139.Tn TCP
140sends data when it is presented;
141when outstanding data has not yet been acknowledged, it gathers
142small amounts of output to be sent in a single packet once
143an acknowledgement is received.
144For a small number of clients, such as window systems
145that send a stream of mouse events which receive no replies,
146this packetization may cause significant delays.
147The boolean option
148.Dv TCP_NODELAY
149defeats this algorithm.
150.It Dv TCP_MAXSEG
151By default, a sender- and
152.No receiver- Ns Tn TCP
153will negotiate among themselves to determine the maximum segment size
154to be used for each connection.
155The
156.Dv TCP_MAXSEG
157option allows the user to determine the result of this negotiation,
158and to reduce it if desired.
159.It Dv TCP_NOOPT
160.Tn TCP
161usually sends a number of options in each packet, corresponding to
162various
163.Tn TCP
164extensions which are provided in this implementation.
165The boolean option
166.Dv TCP_NOOPT
167is provided to disable
168.Tn TCP
169option use on a per-connection basis.
170.It Dv TCP_NOPUSH
171By convention, the
172.No sender- Ns Tn TCP
173will set the
174.Dq push
175bit, and begin transmission immediately (if permitted) at the end of
176every user call to
177.Xr write 2
178or
179.Xr writev 2 .
180When this option is set to a non-zero value,
181.Tn TCP
182will delay sending any data at all until either the socket is closed,
183or the internal send buffer is filled.
184.It Dv TCP_MD5SIG
185This option enables the use of MD5 digests (also known as TCP-MD5)
186on writes to the specified socket.
187In the current release, only outgoing traffic is digested;
188digests on incoming traffic are not verified.
189The current default behavior for the system is to respond to a system
190advertising this option with TCP-MD5; this may change.
191.Pp
192One common use for this in a
193.Fx
194router deployment is to enable
195based routers to interwork with Cisco equipment at peering points.
196Support for this feature conforms to RFC 2385.
197Only IPv4
198.Pq Dv AF_INET
199sessions are supported.
200.Pp
201In order for this option to function correctly, it is necessary for the
202administrator to add a tcp-md5 key entry to the system's security
203associations database (SADB) using the
204.Xr setkey 8
205utility.
206This entry must have an SPI of 0x1000 and can therefore only be specified
207on a per-host basis at this time.
208.Pp
209If an SADB entry cannot be found for the destination, the outgoing traffic
210will have an invalid digest option prepended, and the following error message
211will be visible on the system console:
212.Em "tcp_signature_compute: SADB lookup failed for %d.%d.%d.%d" .
213.El
214.Pp
215The option level for the
216.Xr setsockopt 2
217call is the protocol number for
218.Tn TCP ,
219available from
220.Xr getprotobyname 3 ,
221or
222.Dv IPPROTO_TCP .
223All options are declared in
224.In netinet/tcp.h .
225.Pp
226Options at the
227.Tn IP
228transport level may be used with
229.Tn TCP ;
230see
231.Xr ip 4 .
232Incoming connection requests that are source-routed are noted,
233and the reverse source route is used in responding.
234.Ss MIB Variables
235The
236.Tn TCP
237protocol implements a number of variables in the
238.Va net.inet.tcp
239branch of the
240.Xr sysctl 3
241MIB.
242.Bl -tag -width ".Va TCPCTL_DO_RFC1323"
243.It Dv TCPCTL_DO_RFC1323
244.Pq Va rfc1323
245Implement the window scaling and timestamp options of RFC 1323
246(default is true).
247.It Dv TCPCTL_MSSDFLT
248.Pq Va mssdflt
249The default value used for the maximum segment size
250.Pq Dq MSS
251when no advice to the contrary is received from MSS negotiation.
252.It Dv TCPCTL_SENDSPACE
253.Pq Va sendspace
254Maximum
255.Tn TCP
256send window.
257.It Dv TCPCTL_RECVSPACE
258.Pq Va recvspace
259Maximum
260.Tn TCP
261receive window.
262.It Va log_in_vain
263Log any connection attempts to ports where there is not a socket
264accepting connections.
265The value of 1 limits the logging to
266.Tn SYN
267(connection establishment) packets only.
268That of 2 results in any
269.Tn TCP
270packets to closed ports being logged.
271Any value unlisted above disables the logging
272(default is 0, i.e., the logging is disabled).
273.It Va slowstart_flightsize
274The number of packets allowed to be in-flight during the
275.Tn TCP
276slow-start phase on a non-local network.
277.It Va local_slowstart_flightsize
278The number of packets allowed to be in-flight during the
279.Tn TCP
280slow-start phase to local machines in the same subnet.
281.It Va msl
282The Maximum Segment Lifetime, in milliseconds, for a packet.
283.It Va keepinit
284Timeout, in milliseconds, for new, non-established
285.Tn TCP
286connections.
287.It Va keepidle
288Amount of time, in milliseconds, that the connection must be idle
289before keepalive probes (if enabled) are sent.
290.It Va keepintvl
291The interval, in milliseconds, between keepalive probes sent to remote
292machines.
293After
294.Dv TCPTV_KEEPCNT
295(default 8) probes are sent, with no response, the connection is dropped.
296.It Va always_keepalive
297Assume that
298.Dv SO_KEEPALIVE
299is set on all
300.Tn TCP
301connections, the kernel will
302periodically send a packet to the remote host to verify the connection
303is still up.
304.It Va icmp_may_rst
305Certain
306.Tn ICMP
307unreachable messages may abort connections in
308.Tn SYN-SENT
309state.
310.It Va do_tcpdrain
311Flush packets in the
312.Tn TCP
313reassembly queue if the system is low on mbufs.
314.It Va blackhole
315If enabled, disable sending of RST when a connection is attempted
316to a port where there is not a socket accepting connections.
317See
318.Xr blackhole 4 .
319.It Va delayed_ack
320Delay ACK to try and piggyback it onto a data packet.
321.It Va delacktime
322Maximum amount of time, in milliseconds, before a delayed ACK is sent.
323.It Va newreno
324Enable
325.Tn TCP
326NewReno Fast Recovery algorithm,
327as described in RFC 2582.
328.It Va path_mtu_discovery
329Enable Path MTU Discovery.
330.It Va tcbhashsize
331Size of the
332.Tn TCP
333control-block hash table
334(read-only).
335This may be tuned using the kernel option
336.Dv TCBHASHSIZE
337or by setting
338.Va net.inet.tcp.tcbhashsize
339in the
340.Xr loader 8 .
341.It Va pcbcount
342Number of active process control blocks
343(read-only).
344.It Va syncookies
345Determines whether or not
346.Tn SYN
347cookies should be generated for outbound
348.Tn SYN-ACK
349packets.
350.Tn SYN
351cookies are a great help during
352.Tn SYN
353flood attacks, and are enabled by default.
354(See
355.Xr syncookies 4 . )
356.It Va isn_reseed_interval
357The interval (in seconds) specifying how often the secret data used in
358RFC 1948 initial sequence number calculations should be reseeded.
359By default, this variable is set to zero, indicating that
360no reseeding will occur.
361Reseeding should not be necessary, and will break
362.Dv TIME_WAIT
363recycling for a few minutes.
364.It Va rexmit_min , rexmit_slop
365Adjust the retransmit timer calculation for
366.Tn TCP .
367The slop is
368typically added to the raw calculation to take into account
369occasional variances that the
370.Tn SRTT
371(smoothed round-trip time)
372is unable to accommodate, while the minimum specifies an
373absolute minimum.
374While a number of
375.Tn TCP
376RFCs suggest a 1
377second minimum, these RFCs tend to focus on streaming behavior,
378and fail to deal with the fact that a 1 second minimum has severe
379detrimental effects over lossy interactive connections, such
380as a 802.11b wireless link, and over very fast but lossy
381connections for those cases not covered by the fast retransmit
382code.
383For this reason, we use 200ms of slop and a near-0
384minimum, which gives us an effective minimum of 200ms (similar to
385.Tn Linux ) .
386.It Va inflight.enable
387Enable
388.Tn TCP
389bandwidth-delay product limiting.
390An attempt will be made to calculate
391the bandwidth-delay product for each individual
392.Tn TCP
393connection, and limit
394the amount of inflight data being transmitted, to avoid building up
395unnecessary packets in the network.
396This option is recommended if you
397are serving a lot of data over connections with high bandwidth-delay
398products, such as modems, GigE links, and fast long-haul WANs, and/or
399you have configured your machine to accommodate large
400.Tn TCP
401windows.
402In such
403situations, without this option, you may experience high interactive
404latencies or packet loss due to the overloading of intermediate routers
405and switches.
406Note that bandwidth-delay product limiting only effects
407the transmit side of a
408.Tn TCP
409connection.
410.It Va inflight.debug
411Enable debugging for the bandwidth-delay product algorithm.
412.It Va inflight.min
413This puts a lower bound on the bandwidth-delay product window, in bytes.
414A value of 1024 is typically used for debugging.
4156000-16000 is more typical in a production installation.
416Setting this value too low may result in
417slow ramp-up times for bursty connections.
418Setting this value too high effectively disables the algorithm.
419.It Va inflight.max
420This puts an upper bound on the bandwidth-delay product window, in bytes.
421This value should not generally be modified, but may be used to set a
422global per-connection limit on queued data, potentially allowing you to
423intentionally set a less than optimum limit, to smooth data flow over a
424network while still being able to specify huge internal
425.Tn TCP
426buffers.
427.It Va inflight.stab
428The bandwidth-delay product algorithm requires a slightly larger window
429than it otherwise calculates for stability.
430This parameter determines the extra window in maximal packets / 10.
431The default value of 20 represents 2 maximal packets.
432Reducing this value is not recommended, but you may
433come across a situation with very slow links where the
434.Xr ping 8
435time
436reduction of the default inflight code is not sufficient.
437If this case occurs, you should first try reducing
438.Va inflight.min
439and, if that does not
440work, reduce both
441.Va inflight.min
442and
443.Va inflight.stab ,
444trying values of
44515, 10, or 5 for the latter.
446Never use a value less than 5.
447Reducing
448.Va inflight.stab
449can lead to upwards of a 20% underutilization of the link
450as well as reducing the algorithm's ability to adapt to changing
451situations and should only be done as a last resort.
452.It Va rfc3042
453Enable the Limited Transmit algorithm as described in RFC 3042.
454It helps avoid timeouts on lossy links and also when the congestion window
455is small, as happens on short transfers.
456.It Va rfc3390
457Enable support for RFC 3390, which allows for a variable-sized
458starting congestion window on new connections, depending on the
459maximum segment size.
460This helps throughput in general, but
461particularly affects short transfers and high-bandwidth large
462propagation-delay connections.
463.Pp
464When this feature is enabled, the
465.Va slowstart_flightsize
466and
467.Va local_slowstart_flightsize
468settings are not observed for new
469connection slow starts, but they are still used for slow starts
470that occur when the connection has been idle and starts sending
471again.
472.It Va sack.enable
473Enable support for RFC 2018, TCP Selective Acknowledgment option,
474which allows the receiver to inform the sender about all successfully
475arrived segments, allowing the sender to retransmit the missing segments
476only.
477.It Va sack.initburst
478Control the number of SACK retransmissions done upon initiation of SACK
479recovery.
480.It Va maxtcptw
481When a TCP connection enters the
482.Dv TIME_WAIT
483state, its associated socket structure is freed, since it is of
484negligible size and use, and a new structure is allocated to contain a
485minimal amount of information necessary for sustaining a connection in
486this state, called the compressed TCP TIME_WAIT state.
487Since this structure is smaller than a socket structure, it can save
488a significant amount of system memory.
489The
490.Va net.inet.tcp.maxtcptw
491MIB variable controls the maximum number of these structures allocated.
492By default, it is initialized to
493.Va kern.ipc.maxsockets
494/ 5.
495.It Va nolocaltimewait
496Suppress creating of compressed TCP TIME_WAIT states for connections in
497which both endpoints are local.
498.El
499.Sh ERRORS
500A socket operation may fail with one of the following errors returned:
501.Bl -tag -width Er
502.It Bq Er EISCONN
503when trying to establish a connection on a socket which
504already has one;
505.It Bq Er ENOBUFS
506when the system runs out of memory for
507an internal data structure;
508.It Bq Er ETIMEDOUT
509when a connection was dropped
510due to excessive retransmissions;
511.It Bq Er ECONNRESET
512when the remote peer
513forces the connection to be closed;
514.It Bq Er ECONNREFUSED
515when the remote
516peer actively refuses connection establishment (usually because
517no process is listening to the port);
518.It Bq Er EADDRINUSE
519when an attempt
520is made to create a socket with a port which has already been
521allocated;
522.It Bq Er EADDRNOTAVAIL
523when an attempt is made to create a
524socket with a network address for which no network interface
525exists;
526.It Bq Er EAFNOSUPPORT
527when an attempt is made to bind or connect a socket to a multicast
528address.
529.El
530.Sh SEE ALSO
531.Xr getsockopt 2 ,
532.Xr socket 2 ,
533.Xr sysctl 3 ,
534.Xr blackhole 4 ,
535.Xr inet 4 ,
536.Xr intro 4 ,
537.Xr ip 4 ,
538.Xr syncache 4 ,
539.Xr setkey 8
540.Rs
541.%A "V. Jacobson"
542.%A "R. Braden"
543.%A "D. Borman"
544.%T "TCP Extensions for High Performance"
545.%O "RFC 1323"
546.Re
547.Rs
548.%A "A. Heffernan"
549.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
550.%O "RFC 2385"
551.Re
552.Sh HISTORY
553The
554.Tn TCP
555protocol appeared in
556.Bx 4.2 .
557The RFC 1323 extensions for window scaling and timestamps were added
558in
559.Bx 4.4 .
560The
561.Dv TCP_INFO
562option was introduced in
563.Tn Linux 2.6
564and is
565.Em subject to change .
566