xref: /freebsd/usr.bin/sort/sort.1.in (revision 6fe0a6c80a1aff14236924eb33e4013aa8c14f91)
1.\"	$OpenBSD: sort.1,v 1.45 2015/03/19 13:51:10 jmc Exp $
2.\"	$FreeBSD$
3.\"
4.\" Copyright (c) 1991, 1993
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" the Institute of Electrical and Electronics Engineers, Inc.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. Neither the name of the University nor the names of its contributors
19.\"    may be used to endorse or promote products derived from this software
20.\"    without specific prior written permission.
21.\"
22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32.\" SUCH DAMAGE.
33.\"
34.\"     @(#)sort.1	8.1 (Berkeley) 6/6/93
35.\"
36.Dd September 4, 2019
37.Dt SORT 1
38.Os
39.Sh NAME
40.Nm sort
41.Nd sort or merge records (lines) of text and binary files
42.Sh SYNOPSIS
43.Nm
44.Bk -words
45.Op Fl bcCdfghiRMmnrsuVz
46.Sm off
47.Op Fl k\ \& Ar field1 Op , Ar field2
48.Sm on
49.Op Fl S Ar memsize
50.Ek
51.Op Fl T Ar dir
52.Op Fl t Ar char
53.Op Fl o Ar output
54.Op Ar file ...
55.Nm
56.Fl Fl help
57.Nm
58.Fl Fl version
59.Sh DESCRIPTION
60The
61.Nm
62utility sorts text and binary files by lines.
63A line is a record separated from the subsequent record by a
64newline (default) or NUL \'\\0\' character (-z option).
65A record can contain any printable or unprintable characters.
66Comparisons are based on one or more sort keys extracted from
67each line of input, and are performed lexicographically,
68according to the current locale's collating rules and the
69specified command-line options that can tune the actual
70sorting behavior.
71By default, if keys are not given,
72.Nm
73uses entire lines for comparison.
74.Pp
75The command line options are as follows:
76.Bl -tag -width Ds
77.It Fl c , Fl Fl check , Fl C , Fl Fl check=silent|quiet
78Check that the single input file is sorted.
79If the file is not sorted,
80.Nm
81produces the appropriate error messages and exits with code 1,
82otherwise returns 0.
83If
84.Fl C
85or
86.Fl Fl check=silent
87is specified,
88.Nm
89produces no output.
90This is a "silent" version of
91.Fl c .
92.It Fl m , Fl Fl merge
93Merge only.
94The input files are assumed to be pre-sorted.
95If they are not sorted the output order is undefined.
96.It Fl o Ar output , Fl Fl output Ns = Ns Ar output
97Print the output to the
98.Ar output
99file instead of the standard output.
100.It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size
101Use
102.Ar size
103for the maximum size of the memory buffer.
104Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used.
105If a memory limit is not explicitly specified,
106.Nm
107takes up to about 90% of available memory.
108If the file size is too big to fit into the memory buffer,
109the temporary disk files are used to perform the sorting.
110.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir
111Store temporary files in the directory
112.Ar dir .
113The default path is the value of the environment variable
114.Ev TMPDIR
115or
116.Pa /var/tmp
117if
118.Ev TMPDIR
119is not defined.
120.It Fl u , Fl Fl unique
121Unique keys.
122Suppress all lines that have a key that is equal to an already
123processed one.
124This option, similarly to
125.Fl s ,
126implies a stable sort.
127If used with
128.Fl c
129or
130.Fl C ,
131.Nm
132also checks that there are no lines with duplicate keys.
133.It Fl s
134Stable sort.
135This option maintains the original record order of records that have
136an equal key.
137This is a non-standard feature, but it is widely accepted and used.
138.It Fl Fl version
139Print the version and silently exits.
140.It Fl Fl help
141Print the help text and silently exits.
142.El
143.Pp
144The following options override the default ordering rules.
145When ordering options appear independently of key field
146specifications, they apply globally to all sort keys.
147When attached to a specific key (see
148.Fl k ) ,
149the ordering options override all global ordering options for
150the key they are attached to.
151.Bl -tag -width indent
152.It Fl b , Fl Fl ignore-leading-blanks
153Ignore leading blank characters when comparing lines.
154.It Fl d , Fl Fl dictionary-order
155Consider only blank spaces and alphanumeric characters in comparisons.
156.It Fl f , Fl Fl ignore-case
157Convert all lowercase characters to their uppercase equivalent
158before comparison, that is, perform case-independent sorting.
159.It Fl g , Fl Fl general-numeric-sort , Fl Fl sort=general-numeric
160Sort by general numerical value.
161As opposed to
162.Fl n ,
163this option handles general floating points.
164It has a more
165permissive format than that allowed by
166.Fl n
167but it has a significant performance drawback.
168.It Fl h , Fl Fl human-numeric-sort , Fl Fl sort=human-numeric
169Sort by numerical value, but take into account the SI suffix,
170if present.
171Sort first by numeric sign (negative, zero, or
172positive); then by SI suffix (either empty, or `k' or `K', or one
173of `MGTPEZY', in that order); and finally by numeric value.
174The SI suffix must immediately follow the number.
175For example, '12345K' sorts before '1M', because M is "larger" than K.
176This sort option is useful for sorting the output of a single invocation
177of 'df' command with
178.Fl h
179or
180.Fl H
181options (human-readable).
182.It Fl i , Fl Fl ignore-nonprinting
183Ignore all non-printable characters.
184.It Fl M , Fl Fl month-sort , Fl Fl sort=month
185Sort by month abbreviations.
186Unknown strings are considered smaller than the month names.
187.It Fl n , Fl Fl numeric-sort , Fl Fl sort=numeric
188Sort fields numerically by arithmetic value.
189Fields are supposed to have optional blanks in the beginning, an
190optional minus sign, zero or more digits (including decimal point and
191possible thousand separators).
192.It Fl R , Fl Fl random-sort , Fl Fl sort=random
193Sort by a random order.
194This is a random permutation of the inputs except that
195the equal keys sort together.
196It is implemented by hashing the input keys and sorting
197the hash values.
198The hash function is chosen randomly.
199The hash function is randomized by
200.Cm /dev/random
201content, or by file content if it is specified by
202.Fl Fl random-source .
203Even if multiple sort fields are specified,
204the same random hash function is used for all of them.
205.It Fl r , Fl Fl reverse
206Sort in reverse order.
207.It Fl V , Fl Fl version-sort
208Sort version numbers.
209The input lines are treated as file names in form
210PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
211"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
212The files are compared by their prefixes and versions (leading
213zeros are ignored in version numbers, see example below).
214If an input string does not match the pattern, then it is compared
215using the byte compare function.
216All string comparisons are performed in C locale, the locale
217environment setting is ignored.
218.Bl -tag -width indent
219.It Example:
220.It $ ls sort* | sort -V
221.It sort-1.022.tgz
222.It sort-1.23.tgz
223.It sort-1.23.1.tgz
224.It sort-1.024.tgz
225.It sort-1.024.003.
226.It sort-1.024.003.tgz
227.It sort-1.024.07.tgz
228.It sort-1.024.009.tgz
229.El
230.El
231.Pp
232The treatment of field separators can be altered using these options:
233.Bl -tag -width indent
234.It Fl b , Fl Fl ignore-leading-blanks
235Ignore leading blank space when determining the start
236and end of a restricted sort key (see
237.Fl k ) .
238If
239.Fl b
240is specified before the first
241.Fl k
242option, it applies globally to all key specifications.
243Otherwise,
244.Fl b
245can be attached independently to each
246.Ar field
247argument of the key specifications.
248.Fl b .
249.It Xo
250.Fl k Ar field1 Ns Op , Ns Ar field2 ,
251.Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2
252.Xc
253Define a restricted sort key that has the starting position
254.Ar field1 ,
255and optional ending position
256.Ar field2
257of a key field.
258The
259.Fl k
260option may be specified multiple times,
261in which case subsequent keys are compared when earlier keys compare equal.
262The
263.Fl k
264option replaces the obsolete options
265.Cm \(pl Ns Ar pos1
266and
267.Fl Ns Ar pos2 ,
268but the old notation is also supported.
269.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char
270Use
271.Ar char
272as a field separator character.
273The initial
274.Ar char
275is not considered to be part of a field when determining key offsets.
276Each occurrence of
277.Ar char
278is significant (for example,
279.Dq Ar charchar
280delimits an empty field).
281If
282.Fl t
283is not specified, the default field separator is a sequence of
284blank space characters, and consecutive blank spaces do
285.Em not
286delimit an empty field, however, the initial blank space
287.Em is
288considered part of a field when determining key offsets.
289To use NUL as field separator, use
290.Fl t
291\'\\0\'.
292.It Fl z , Fl Fl zero-terminated
293Use NUL as record separator.
294By default, records in the files are supposed to be separated by
295the newline characters.
296With this option, NUL (\'\\0\') is used as a record separator character.
297.El
298.Pp
299Other options:
300.Bl -tag -width indent
301.It Fl Fl batch-size Ns = Ns Ar num
302Specify maximum number of files that can be opened by
303.Nm
304at once.
305This option affects behavior when having many input files or using
306temporary files.
307The default value is 16.
308.It Fl Fl compress-program Ns = Ns Ar PROGRAM
309Use PROGRAM to compress temporary files.
310PROGRAM must compress standard input to standard output, when called
311without arguments.
312When called with argument
313.Fl d
314it must decompress standard input to standard output.
315If PROGRAM fails,
316.Nm
317must exit with error.
318An example of PROGRAM that can be used here is bzip2.
319.It Fl Fl random-source Ns = Ns Ar filename
320In random sort, the file content is used as the source of the 'seed' data
321for the hash function choice.
322Two invocations of random sort with the same seed data will use
323the same hash function and will produce the same result if the input is
324also identical.
325By default, file
326.Cm /dev/random
327is used.
328.It Fl Fl debug
329Print some extra information about the sorting process to the
330standard output.
331%%THREADS%%.It Fl Fl parallel
332%%THREADS%%Set the maximum number of execution threads.
333%%THREADS%%Default number equals to the number of CPUs.
334.It Fl Fl files0-from Ns = Ns Ar filename
335Take the input file list from the file
336.Ar filename .
337The file names must be separated by NUL
338(like the output produced by the command "find ... -print0").
339.It Fl Fl radixsort
340Try to use radix sort, if the sort specifications allow.
341The radix sort can only be used for trivial locales (C and POSIX),
342and it cannot be used for numeric or month sort.
343Radix sort is very fast and stable.
344.It Fl Fl mergesort
345Use mergesort.
346This is a universal algorithm that can always be used,
347but it is not always the fastest.
348.It Fl Fl qsort
349Try to use quick sort, if the sort specifications allow.
350This sort algorithm cannot be used with
351.Fl u
352and
353.Fl s .
354.It Fl Fl heapsort
355Try to use heap sort, if the sort specifications allow.
356This sort algorithm cannot be used with
357.Fl u
358and
359.Fl s .
360.It Fl Fl mmap
361Try to use file memory mapping system call.
362It may increase speed in some cases.
363.El
364.Pp
365The following operands are available:
366.Bl -tag -width indent
367.It Ar file
368The pathname of a file to be sorted, merged, or checked.
369If no
370.Ar file
371operands are specified, or if a
372.Ar file
373operand is
374.Fl ,
375the standard input is used.
376.El
377.Pp
378A field is defined as a maximal sequence of characters other than the
379field separator and record separator (newline by default).
380Initial blank spaces are included in the field unless
381.Fl b
382has been specified;
383the first blank space of a sequence of blank spaces acts as the field
384separator and is included in the field (unless
385.Fl t
386is specified).
387For example, all blank spaces at the beginning of a line are
388considered to be part of the first field.
389.Pp
390Fields are specified by the
391.Sm off
392.Fl k\ \& Ar field1 Op , Ar field2
393.Sm on
394command-line option.
395If
396.Ar field2
397is missing, the end of the key defaults to the end of the line.
398.Pp
399The arguments
400.Ar field1
401and
402.Ar field2
403have the form
404.Em m.n
405.Em (m,n > 0)
406and can be followed by one or more of the modifiers
407.Cm b , d , f , i ,
408.Cm n , g , M
409and
410.Cm r ,
411which correspond to the options discussed above.
412When
413.Cm b
414is specified it applies only to
415.Ar field1
416or
417.Ar field2
418where it is specified while the rest of the modifiers
419apply to the whole key field regardless if they are
420specified only with
421.Ar field1
422or
423.Ar field2
424or both.
425A
426.Ar field1
427position specified by
428.Em m.n
429is interpreted as the
430.Em n Ns th
431character from the beginning of the
432.Em m Ns th
433field.
434A missing
435.Em \&.n
436in
437.Ar field1
438means
439.Ql \&.1 ,
440indicating the first character of the
441.Em m Ns th
442field; if the
443.Fl b
444option is in effect,
445.Em n
446is counted from the first non-blank character in the
447.Em m Ns th
448field;
449.Em m Ns \&.1b
450refers to the first non-blank character in the
451.Em m Ns th
452field.
453.No 1\&. Ns Em n
454refers to the
455.Em n Ns th
456character from the beginning of the line;
457if
458.Em n
459is greater than the length of the line, the field is taken to be empty.
460.Pp
461.Em n Ns th
462positions are always counted from the field beginning, even if the field
463is shorter than the number of specified positions.
464Thus, the key can really start from a position in a subsequent field.
465.Pp
466A
467.Ar field2
468position specified by
469.Em m.n
470is interpreted as the
471.Em n Ns th
472character (including separators) from the beginning of the
473.Em m Ns th
474field.
475A missing
476.Em \&.n
477indicates the last character of the
478.Em m Ns th
479field;
480.Em m
481= \&0
482designates the end of a line.
483Thus the option
484.Fl k Ar v.x,w.y
485is synonymous with the obsolete option
486.Cm \(pl Ns Ar v-\&1.x-\&1
487.Fl Ns Ar w-\&1.y ;
488when
489.Em y
490is omitted,
491.Fl k Ar v.x,w
492is synonymous with
493.Cm \(pl Ns Ar v-\&1.x-\&1
494.Fl Ns Ar w\&.0 .
495The obsolete
496.Cm \(pl Ns Ar pos1
497.Fl Ns Ar pos2
498option is still supported, except for
499.Fl Ns Ar w\&.0b ,
500which has no
501.Fl k
502equivalent.
503.Sh ENVIRONMENT
504.Bl -tag -width Fl
505.It Ev LC_COLLATE
506Locale settings to be used to determine the collation for
507sorting records.
508.It Ev LC_CTYPE
509Locale settings to be used to case conversion and classification
510of characters, that is, which characters are considered
511whitespaces, etc.
512.It Ev LC_MESSAGES
513Locale settings that determine the language of output messages
514that
515.Nm
516prints out.
517.It Ev LC_NUMERIC
518Locale settings that determine the number format used in numeric sort.
519.It Ev LC_TIME
520Locale settings that determine the month format used in month sort.
521.It Ev LC_ALL
522Locale settings that override all of the above locale settings.
523This environment variable can be used to set all these settings
524to the same value at once.
525.It Ev LANG
526Used as a last resort to determine different kinds of locale-specific
527behavior if neither the respective environment variable, nor
528.Ev LC_ALL
529are set.
530.It Ev TMPDIR
531Path to the directory in which temporary files will be stored.
532Note that
533.Ev TMPDIR
534may be overridden by the
535.Fl T
536option.
537.It Ev GNUSORT_NUMERIC_COMPATIBILITY
538If defined
539.Fl t
540will not override the locale numeric symbols, that is, thousand
541separators and decimal separators.
542By default, if we specify
543.Fl t
544with the same symbol as the thousand separator or decimal point,
545the symbol will be treated as the field separator.
546Older behavior was less definite; the symbol was treated as both field
547separator and numeric separator, simultaneously.
548This environment variable enables the old behavior.
549.El
550.Sh FILES
551.Bl -tag -width Pa -compact
552.It Pa /var/tmp/.bsdsort.PID.*
553Temporary files.
554.It Pa /dev/random
555Default seed file for the random sort.
556.El
557.Sh EXIT STATUS
558The
559.Nm
560utility shall exit with one of the following values:
561.Pp
562.Bl -tag -width flag -compact
563.It 0
564Successfully sorted the input files or if used with
565.Fl c
566or
567.Fl C ,
568the input file already met the sorting criteria.
569.It 1
570On disorder (or non-uniqueness) with the
571.Fl c
572or
573.Fl C
574options.
575.It 2
576An error occurred.
577.El
578.Sh SEE ALSO
579.Xr comm 1 ,
580.Xr join 1 ,
581.Xr uniq 1
582.Sh STANDARDS
583The
584.Nm
585utility is compliant with the
586.St -p1003.1-2008
587specification.
588.Pp
589The flags
590.Op Fl ghRMSsTVz
591are extensions to the POSIX specification.
592.Pp
593All long options are extensions to the specification, some of them are
594provided for compatibility with GNU versions and some of them are
595own extensions.
596.Pp
597The old key notations
598.Cm \(pl Ns Ar pos1
599and
600.Fl Ns Ar pos2
601come from older versions of
602.Nm
603and are still supported but their use is highly discouraged.
604.Sh HISTORY
605A
606.Nm
607command first appeared in
608.At v1 .
609.Sh AUTHORS
610.An Gabor Kovesdan Aq Mt gabor@FreeBSD.org ,
611.Pp
612.An Oleg Moskalenko Aq Mt mom040267@gmail.com
613.Sh NOTES
614This implementation of
615.Nm
616has no limits on input line length (other than imposed by available
617memory) or any restrictions on bytes allowed within lines.
618.Pp
619The performance depends highly on locale settings,
620efficient choice of sort keys and key complexity.
621The fastest sort is with locale C, on whole lines,
622with option
623.Fl s .
624In general, locale C is the fastest, then single-byte
625locales follow and multi-byte locales as the slowest but
626the correct collation order is always respected.
627As for the key specification, the simpler to process the
628lines the faster the search will be.
629.Pp
630When sorting by arithmetic value, using
631.Fl n
632results in much better performance than
633.Fl g
634so its use is encouraged
635whenever possible.
636