xref: /freebsd/usr.bin/sort/sort.1.in (revision 6574b8ed19b093f0af09501d2c9676c28993cb97)
1.\"	$OpenBSD: sort.1,v 1.31 2007/08/21 21:22:37 millert Exp $
2.\"	$FreeBSD$
3.\"
4.\" Copyright (c) 1991, 1993
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" the Institute of Electrical and Electronics Engineers, Inc.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. Neither the name of the University nor the names of its contributors
19.\"    may be used to endorse or promote products derived from this software
20.\"    without specific prior written permission.
21.\"
22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32.\" SUCH DAMAGE.
33.\"
34.\"     @(#)sort.1	8.1 (Berkeley) 6/6/93
35.\"
36.Dd July 3, 2012
37.Dt SORT 1
38.Os
39.Sh NAME
40.Nm sort
41.Nd sort or merge records (lines) of text and binary files
42.Sh SYNOPSIS
43.Nm sort
44.Bk -words
45.Op Fl bcCdfghiRMmnrsuVz
46.Sm off
47.Op Fl k\ \& Ar field1 Op , Ar field2
48.Sm on
49.Op Fl S Ar memsize
50.Ek
51.Op Fl T Ar dir
52.Op Fl t Ar char
53.Op Fl o Ar output
54.Op Ar file ...
55.Nm sort
56.Fl Fl help
57.Nm sort
58.Fl Fl version
59.Sh DESCRIPTION
60The
61.Nm
62utility sorts text and binary files by lines.
63A line is a record separated from the subsequent record by a
64newline (default) or NUL \'\\0\' character (-z option).
65A record can contain any printable or unprintable characters.
66Comparisons are based on one or more sort keys extracted from
67each line of input, and are performed lexicographically,
68according to the current locale's collating rules and the
69specified command-line options that can tune the actual
70sorting behavior.
71By default, if keys are not given,
72.Nm
73uses entire lines for comparison.
74.Pp
75The command line options are as follows:
76.Bl -tag -width Ds
77.It Fl c, Fl Fl check, Fl C, Fl Fl check=silent|quiet
78Check that the single input file is sorted.
79If the file is not sorted,
80.Nm
81produces the appropriate error messages and exits with code 1,
82otherwise returns 0.
83If
84.Fl C
85or
86.Fl Fl check=silent
87is specified,
88.Nm
89produces no output.
90This is a "silent" version of
91.Fl c.
92.It Fl m , Fl Fl merge
93Merge only.
94The input files are assumed to be pre-sorted.
95If they are not sorted the output order is undefined.
96.It Fl o Ar output , Fl Fl output Ns = Ns Ar output
97Print the output to the
98.Ar output
99file instead of the standard output.
100.It Fl S Ar size, Fl Fl buffer-size Ns = Ns Ar size
101Use
102.Ar size
103for the maximum size of the memory buffer.
104Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used.
105If a memory limit is not explicitly specified,
106.Nm
107takes up to about 90% of available memory.
108If the file size is too big to fit into the memory buffer,
109the temporary disk files are used to perform the sorting.
110.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir
111Store temporary files in the directory
112.Ar dir .
113The default path is the value of the environment variable
114.Ev TMPDIR
115or
116.Pa /var/tmp
117if
118.Ev TMPDIR
119is not defined.
120.It Fl u , Fl Fl unique
121Unique keys.
122Suppress all lines that have a key that is equal to an already
123processed one.
124This option, similarly to
125.Fl s ,
126implies a stable sort.
127If used with
128.Fl c
129or
130.Fl C ,
131.Nm
132also checks that there are no lines with duplicate keys.
133.It Fl s
134Stable sort.
135This option maintains the original record order of records that have
136and equal key.
137This is a non-standard feature, but it is widely accepted and used.
138.It Fl Fl version
139Print the version and silently exits.
140.It Fl Fl help
141Print the help text and silently exits.
142.El
143.Pp
144The following options override the default ordering rules.
145When ordering options appear independently of key field
146specifications, they apply globally to all sort keys.
147When attached to a specific key (see
148.Fl k ) ,
149the ordering options override all global ordering options for
150the key they are attached to.
151.Bl -tag -width indent
152.It Fl b, Fl Fl ignore-leading-blanks
153Ignore leading blank characters when comparing lines.
154.It Fl d , Fl Fl dictionary-order
155Consider only blank spaces and alphanumeric characters in comparisons.
156.It Fl f , Fl Fl ignore-case
157Convert all lowercase characters to their uppercase equivalent
158before comparison, that is, perform case-independent sorting.
159.It Fl g, Fl Fl general-numeric-sort, Fl Fl sort=general-numeric
160Sort by general numerical value.
161As opposed to
162.Fl n ,
163this option handles general floating points, which have a much
164permissive format than those allowed by
165. Fl n ,
166but it has a significant performance drawback.
167.It Fl h, Fl Fl human-numeric-sort, Fl Fl sort=human-numeric
168Sort by numerical value, but take into account the SI suffix,
169if present.
170Sort first by numeric sign (negative, zero, or
171positive); then by SI suffix (either empty, or `k' or `K', or one
172of `MGTPEZY', in that order); and finally by numeric value.
173The SI suffix must immediately follow the number.
174For example, '12345K' sorts before '1M', because M is "larger" than K.
175This sort option is useful for sorting the output of a single invocation
176of 'df' command with
177.Fl h
178or
179.Fl H
180options (human-readable).
181.It Fl i , Fl Fl ignore-nonprinting
182Ignore all non-printable characters.
183.It Fl M, Fl Fl month-sort, Fl Fl sort=month
184Sort by month abbreviations.
185Unknown strings are considered smaller than the month names.
186.It Fl n , Fl Fl numeric-sort, Fl Fl sort=numeric
187Sort fields numerically by arithmetic value.
188Fields are supposed to have optional blanks in the beginning, an
189optional minus sign, zero or more digits (including decimal point and
190possible thousand separators).
191.It Fl R, Fl Fl random-sort, Fl Fl sort=random
192Sort by a random order.
193This is a random permutation of the inputs except that
194the equal keys sort together.
195It is implemented by hashing the input keys and sorting
196the hash values.
197The hash function is chosen randomly.
198The hash function is randomized by
199.Cm /dev/random
200content, or by file content if it is specified by
201.Fl Fl random-source .
202Even if multiple sort fields are specified,
203the same random hash function is used for all of them.
204.It Fl r , Fl Fl reverse
205Sort in reverse order.
206.It Fl V, Fl Fl version-sort
207Sort version numbers.
208The input lines are treated as file names in form
209PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
210"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
211The files are compared by their prefixes and versions (leading
212zeros are ignored in version numbers, see example below).
213If an input string does not match the pattern, then it is compared
214using the byte compare function.
215All string comparisons are performed in C locale, the locale
216environment setting is ignored.
217.Bl -tag -width indent
218.It Example:
219.It $ ls sort* | sort -V
220.It sort-1.022.tgz
221.It sort-1.23.tgz
222.It sort-1.23.1.tgz
223.It sort-1.024.tgz
224.It sort-1.024.003.
225.It sort-1.024.003.tgz
226.It sort-1.024.07.tgz
227.It sort-1.024.009.tgz
228.El
229.El
230.Pp
231The treatment of field separators can be altered using these options:
232.Bl -tag -width indent
233.It Fl b , Fl Fl ignore-leading-blanks
234Ignore leading blank space when determining the start
235and end of a restricted sort key (see
236.Fl k
237).
238If
239.Fl b
240is specified before the first
241.Fl k
242option, it applies globally to all key specifications.
243Otherwise,
244.Fl b
245can be attached independently to each
246.Ar field
247argument of the key specifications.
248.Fl b .
249.It Xo
250.Sm off
251.Fl k\ \& Ar field1 Op , Ar field2 , Fl Fl key Ns = Ns Ar field1 Op , Ar field2
252.Sm on
253.Xc
254Define a restricted sort key that has the starting position
255.Ar field1 ,
256and optional ending position
257.Ar field2
258of a key field.
259The
260.Fl k
261option may be specified multiple times,
262in which case subsequent keys are compared when earlier keys compare equal.
263The
264.Fl k
265option replaces the obsolete options
266.Cm \(pl Ns Ar pos1
267and
268.Fl Ns Ar pos2 ,
269but the old notation is also supported.
270.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char
271Use
272.Ar char
273as a field separator character.
274The initial
275.Ar char
276is not considered to be part of a field when determining key offsets.
277Each occurrence of
278.Ar char
279is significant (for example,
280.Dq Ar charchar
281delimits an empty field).
282If
283.Fl t
284is not specified, the default field separator is a sequence of
285blank space characters, and consecutive blank spaces do
286.Em not
287delimit an empty field, however, the initial blank space
288.Em is
289considered part of a field when determining key offsets.
290To use NUL as field separator, use
291.Fl t
292\'\\0\'.
293.It Fl z , Fl Fl zero-terminated
294Use NUL as record separator.
295By default, records in the files are supposed to be separated by
296the newline characters.
297With this option, NUL (\'\\0\') is used as a record separator character.
298.El
299.Pp
300Other options:
301.Bl -tag -width indent
302.It Fl Fl batch-size Ns = Ns Ar num
303Specify maximum number of files that can be opened by
304.Nm
305at once.
306This option affects behavior when having many input files or using
307temporary files.
308The default value is 16.
309.It Fl Fl compress-program Ns = Ns Ar PROGRAM
310Use PROGRAM to compress temporary files.
311PROGRAM must compress standard input to standard output, when called
312without arguments.
313When called with argument
314.Fl d
315it must decompress standard input to standard output.
316If PROGRAM fails,
317.Nm
318must exit with error.
319An example of PROGRAM that can be used here is bzip2.
320.It Fl Fl random-source Ns = Ns Ar filename
321In random sort, the file content is used as the source of the 'seed' data
322for the hash function choice.
323Two invocations of random sort with the same seed data will use
324the same hash function and will produce the same result if the input is
325also identical.
326By default, file
327.Cm /dev/random
328is used.
329.It Fl Fl debug
330Print some extra information about the sorting process to the
331standard output.
332%%THREADS%%.It Fl Fl parallel
333%%THREADS%%Set the maximum number of execution threads.
334%%THREADS%%Default number equals to the number of CPUs.
335.It Fl Fl files0-from Ns = Ns Ar filename
336Take the input file list from the file
337.Ar filename.
338The file names must be separated by NUL
339(like the output produced by the command "find ... -print0").
340.It Fl Fl radixsort
341Try to use radix sort, if the sort specifications allow.
342The radix sort can only be used for trivial locales (C and POSIX),
343and it cannot be used for numeric or month sort.
344Radix sort is very fast and stable.
345.It Fl Fl mergesort
346Use mergesort.
347This is a universal algorithm that can always be used,
348but it is not always the fastest.
349.It Fl Fl qsort
350Try to use quick sort, if the sort specifications allow.
351This sort algorithm cannot be used with
352.Fl u
353and
354.Fl s .
355.It Fl Fl heapsort
356Try to use heap sort, if the sort specifications allow.
357This sort algorithm cannot be used with
358.Fl u
359and
360.Fl s .
361.It Fl Fl mmap
362Try to use file memory mapping system call.
363It may increase speed in some cases.
364.El
365.Pp
366The following operands are available:
367.Bl -tag -width indent
368.It Ar file
369The pathname of a file to be sorted, merged, or checked.
370If no
371.Ar file
372operands are specified, or if a
373.Ar file
374operand is
375.Fl ,
376the standard input is used.
377.El
378.Pp
379A field is defined as a maximal sequence of characters other than the
380field separator and record separator (newline by default).
381Initial blank spaces are included in the field unless
382.Fl b
383has been specified;
384the first blank space of a sequence of blank spaces acts as the field
385separator and is included in the field (unless
386.Fl t
387is specified).
388For example, all blank spaces at the beginning of a line are
389considered to be part of the first field.
390.Pp
391Fields are specified by the
392.Sm off
393.Fl k\ \& Ar field1 Op , Ar field2
394.Sm on
395command-line option.
396If
397.Ar field2
398is missing, the end of the key defaults to the end of the line.
399.Pp
400The arguments
401.Ar field1
402and
403.Ar field2
404have the form
405.Em m.n
406.Em (m,n > 0)
407and can be followed by one or more of the modifiers
408.Cm b , d , f , i ,
409.Cm n , g , M
410and
411.Cm r ,
412which correspond to the options discussed above.
413When
414.Cm b
415is specified it applies only to
416.Ar field1
417or
418.Ar field2
419where it is specified while the rest of the modifiers
420apply to the whole key field regardless if they are
421specified only with
422.Ar field1
423or
424.Ar field2
425or both.
426A
427.Ar field1
428position specified by
429.Em m.n
430is interpreted as the
431.Em n Ns th
432character from the beginning of the
433.Em m Ns th
434field.
435A missing
436.Em \&.n
437in
438.Ar field1
439means
440.Ql \&.1 ,
441indicating the first character of the
442.Em m Ns th
443field; if the
444.Fl b
445option is in effect,
446.Em n
447is counted from the first non-blank character in the
448.Em m Ns th
449field;
450.Em m Ns \&.1b
451refers to the first non-blank character in the
452.Em m Ns th
453field.
454.No 1\&. Ns Em n
455refers to the
456.Em n Ns th
457character from the beginning of the line;
458if
459.Em n
460is greater than the length of the line, the field is taken to be empty.
461.Pp
462.Em n Ns th
463positions are always counted from the field beginning, even if the field
464is shorter than the number of specified positions.
465Thus, the key can really start from a position in a subsequent field.
466.Pp
467A
468.Ar field2
469position specified by
470.Em m.n
471is interpreted as the
472.Em n Ns th
473character (including separators) from the beginning of the
474.Em m Ns th
475field.
476A missing
477.Em \&.n
478indicates the last character of the
479.Em m Ns th
480field;
481.Em m
482= \&0
483designates the end of a line.
484Thus the option
485.Fl k Ar v.x,w.y
486is synonymous with the obsolete option
487.Cm \(pl Ns Ar v-\&1.x-\&1
488.Fl Ns Ar w-\&1.y ;
489when
490.Em y
491is omitted,
492.Fl k Ar v.x,w
493is synonymous with
494.Cm \(pl Ns Ar v-\&1.x-\&1
495.Fl Ns Ar w\&.0 .
496The obsolete
497.Cm \(pl Ns Ar pos1
498.Fl Ns Ar pos2
499option is still supported, except for
500.Fl Ns Ar w\&.0b ,
501which has no
502.Fl k
503equivalent.
504.Sh ENVIRONMENT
505.Bl -tag -width Fl
506.It Ev LC_COLLATE
507Locale settings to be used to determine the collation for
508sorting records.
509.It Ev LC_CTYPE
510Locale settings to be used to case conversion and classification
511of characters, that is, which characters are considered
512whitespaces, etc.
513.It Ev LC_MESSAGES
514Locale settings that determine the language of output messages
515that
516.Nm
517prints out.
518.It Ev LC_NUMERIC
519Locale settings that determine the number format used in numeric sort.
520.It Ev LC_TIME
521Locale settings that determine the month format used in month sort.
522.It Ev LC_ALL
523Locale settings that override all of the above locale settings.
524This environment variable can be used to set all these settings
525to the same value at once.
526.It Ev LANG
527Used as a last resort to determine different kinds of locale-specific
528behavior if neither the respective environment variable, nor
529.Ev LC_ALL
530are set.
531%%NLS%%.It Ev NLSPATH
532%%NLS%%Path to NLS catalogs.
533.It Ev TMPDIR
534Path to the directory in which temporary files will be stored.
535Note that
536.Ev TMPDIR
537may be overridden by the
538.Fl T
539option.
540.It Ev GNUSORT_NUMERIC_COMPATIBILITY
541If defined
542.Fl t
543will not override the locale numeric symbols, that is, thousand
544separators and decimal separators.
545By default, if we specify
546.Fl t
547with the same symbol as the thousand separator or decimal point,
548the symbol will be treated as the field separator.
549Older behavior was less definite; the symbol was treated as both field
550separator and numeric separator, simultaneously.
551This environment variable enables the old behavior.
552.El
553.Sh FILES
554.Bl -tag -width Pa -compact
555.It Pa /var/tmp/.bsdsort.PID.*
556Temporary files.
557.It Pa /dev/random
558Default seed file for the random sort.
559.El
560.Sh EXIT STATUS
561The
562.Nm
563utility shall exit with one of the following values:
564.Pp
565.Bl -tag -width flag -compact
566.It 0
567Successfully sorted the input files or if used with
568.Fl c
569or
570.Fl C ,
571the input file already met the sorting criteria.
572.It 1
573On disorder (or non-uniqueness) with the
574.Fl c
575or
576.Fl C
577options.
578.It 2
579An error occurred.
580.El
581.Sh SEE ALSO
582.Xr comm 1 ,
583.Xr join 1 ,
584.Xr uniq 1
585.Sh STANDARDS
586The
587.Nm
588utility is compliant with the
589.St -p1003.1-2008
590specification.
591.Pp
592The flags
593.Op Fl ghRMSsTVz
594are extensions to the POSIX specification.
595.Pp
596All long options are extensions to the specification, some of them are
597provided for compatibility with GNU versions and some of them are
598own extensions.
599.Pp
600The old key notations
601.Cm \(pl Ns Ar pos1
602and
603.Fl Ns Ar pos2
604come from older versions of
605.Nm
606and are still supported but their use is highly discouraged.
607.Sh HISTORY
608A
609.Nm
610command first appeared in
611.At v3 .
612.Sh AUTHORS
613Gabor Kovesdan <gabor@FreeBSD.org>,
614.Pp
615Oleg Moskalenko <mom040267@gmail.com>
616.Sh NOTES
617This implementation of
618.Nm
619has no limits on input line length (other than imposed by available
620memory) or any restrictions on bytes allowed within lines.
621.Pp
622The performance depends highly on locale settings,
623efficient choice of sort keys and key complexity.
624The fastest sort is with locale C, on whole lines,
625with option
626.Fl s.
627In general, locale C is the fastest, then single-byte
628locales follow and multi-byte locales as the slowest but
629the correct collation order is always respected.
630As for the key specification, the simpler to process the
631lines the faster the search will be.
632.Pp
633When sorting by arithmetic value, using
634.Fl n
635results in much better performance than
636.Fl g
637so its use is encouraged
638whenever possible.
639