xref: /freebsd/usr.bin/sort/sort.1.in (revision 5e3190f700637fcfc1a52daeaa4a031fdd2557c7)
1.\"	$OpenBSD: sort.1,v 1.45 2015/03/19 13:51:10 jmc Exp $
2.\"
3.\" Copyright (c) 1991, 1993
4.\"	The Regents of the University of California.  All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" the Institute of Electrical and Electronics Engineers, Inc.
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\" 3. Neither the name of the University nor the names of its contributors
18.\"    may be used to endorse or promote products derived from this software
19.\"    without specific prior written permission.
20.\"
21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31.\" SUCH DAMAGE.
32.\"
33.\"     @(#)sort.1	8.1 (Berkeley) 6/6/93
34.\"
35.Dd September 4, 2019
36.Dt SORT 1
37.Os
38.Sh NAME
39.Nm sort
40.Nd sort or merge records (lines) of text and binary files
41.Sh SYNOPSIS
42.Nm
43.Bk -words
44.Op Fl bcCdfghiRMmnrsuVz
45.Sm off
46.Op Fl k\ \& Ar field1 Op , Ar field2
47.Sm on
48.Op Fl S Ar memsize
49.Ek
50.Op Fl T Ar dir
51.Op Fl t Ar char
52.Op Fl o Ar output
53.Op Ar file ...
54.Nm
55.Fl Fl help
56.Nm
57.Fl Fl version
58.Sh DESCRIPTION
59The
60.Nm
61utility sorts text and binary files by lines.
62A line is a record separated from the subsequent record by a
63newline (default) or NUL \'\\0\' character (-z option).
64A record can contain any printable or unprintable characters.
65Comparisons are based on one or more sort keys extracted from
66each line of input, and are performed lexicographically,
67according to the current locale's collating rules and the
68specified command-line options that can tune the actual
69sorting behavior.
70By default, if keys are not given,
71.Nm
72uses entire lines for comparison.
73.Pp
74The command line options are as follows:
75.Bl -tag -width Ds
76.It Fl c , Fl Fl check , Fl C , Fl Fl check=silent|quiet
77Check that the single input file is sorted.
78If the file is not sorted,
79.Nm
80produces the appropriate error messages and exits with code 1,
81otherwise returns 0.
82If
83.Fl C
84or
85.Fl Fl check=silent
86is specified,
87.Nm
88produces no output.
89This is a "silent" version of
90.Fl c .
91.It Fl m , Fl Fl merge
92Merge only.
93The input files are assumed to be pre-sorted.
94If they are not sorted the output order is undefined.
95.It Fl o Ar output , Fl Fl output Ns = Ns Ar output
96Print the output to the
97.Ar output
98file instead of the standard output.
99.It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size
100Use
101.Ar size
102for the maximum size of the memory buffer.
103Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used.
104If a memory limit is not explicitly specified,
105.Nm
106takes up to about 90% of available memory.
107If the file size is too big to fit into the memory buffer,
108the temporary disk files are used to perform the sorting.
109.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir
110Store temporary files in the directory
111.Ar dir .
112The default path is the value of the environment variable
113.Ev TMPDIR
114or
115.Pa /var/tmp
116if
117.Ev TMPDIR
118is not defined.
119.It Fl u , Fl Fl unique
120Unique keys.
121Suppress all lines that have a key that is equal to an already
122processed one.
123This option, similarly to
124.Fl s ,
125implies a stable sort.
126If used with
127.Fl c
128or
129.Fl C ,
130.Nm
131also checks that there are no lines with duplicate keys.
132.It Fl s
133Stable sort.
134This option maintains the original record order of records that have
135an equal key.
136This is a non-standard feature, but it is widely accepted and used.
137.It Fl Fl version
138Print the version and silently exits.
139.It Fl Fl help
140Print the help text and silently exits.
141.El
142.Pp
143The following options override the default ordering rules.
144When ordering options appear independently of key field
145specifications, they apply globally to all sort keys.
146When attached to a specific key (see
147.Fl k ) ,
148the ordering options override all global ordering options for
149the key they are attached to.
150.Bl -tag -width indent
151.It Fl b , Fl Fl ignore-leading-blanks
152Ignore leading blank characters when comparing lines.
153.It Fl d , Fl Fl dictionary-order
154Consider only blank spaces and alphanumeric characters in comparisons.
155.It Fl f , Fl Fl ignore-case
156Convert all lowercase characters to their uppercase equivalent
157before comparison, that is, perform case-independent sorting.
158.It Fl g , Fl Fl general-numeric-sort , Fl Fl sort=general-numeric
159Sort by general numerical value.
160As opposed to
161.Fl n ,
162this option handles general floating points.
163It has a more
164permissive format than that allowed by
165.Fl n
166but it has a significant performance drawback.
167.It Fl h , Fl Fl human-numeric-sort , Fl Fl sort=human-numeric
168Sort by numerical value, but take into account the SI suffix,
169if present.
170Sort first by numeric sign (negative, zero, or
171positive); then by SI suffix (either empty, or `k' or `K', or one
172of `MGTPEZY', in that order); and finally by numeric value.
173The SI suffix must immediately follow the number.
174For example, '12345K' sorts before '1M', because M is "larger" than K.
175This sort option is useful for sorting the output of a single invocation
176of 'df' command with
177.Fl h
178or
179.Fl H
180options (human-readable).
181.It Fl i , Fl Fl ignore-nonprinting
182Ignore all non-printable characters.
183.It Fl M , Fl Fl month-sort , Fl Fl sort=month
184Sort by month abbreviations.
185Unknown strings are considered smaller than the month names.
186.It Fl n , Fl Fl numeric-sort , Fl Fl sort=numeric
187Sort fields numerically by arithmetic value.
188Fields are supposed to have optional blanks in the beginning, an
189optional minus sign, zero or more digits (including decimal point and
190possible thousand separators).
191.It Fl R , Fl Fl random-sort , Fl Fl sort=random
192Sort by a random order.
193This is a random permutation of the inputs except that
194the equal keys sort together.
195It is implemented by hashing the input keys and sorting
196the hash values.
197The hash function is chosen randomly.
198The hash function is randomized by
199.Cm /dev/random
200content, or by file content if it is specified by
201.Fl Fl random-source .
202Even if multiple sort fields are specified,
203the same random hash function is used for all of them.
204.It Fl r , Fl Fl reverse
205Sort in reverse order.
206.It Fl V , Fl Fl version-sort
207Sort version numbers.
208The input lines are treated as file names in form
209PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
210"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
211The files are compared by their prefixes and versions (leading
212zeros are ignored in version numbers, see example below).
213If an input string does not match the pattern, then it is compared
214using the byte compare function.
215All string comparisons are performed in C locale, the locale
216environment setting is ignored.
217.Bl -tag -width indent
218.It Example:
219.It $ ls sort* | sort -V
220.It sort-1.022.tgz
221.It sort-1.23.tgz
222.It sort-1.23.1.tgz
223.It sort-1.024.tgz
224.It sort-1.024.003.
225.It sort-1.024.003.tgz
226.It sort-1.024.07.tgz
227.It sort-1.024.009.tgz
228.El
229.El
230.Pp
231The treatment of field separators can be altered using these options:
232.Bl -tag -width indent
233.It Fl b , Fl Fl ignore-leading-blanks
234Ignore leading blank space when determining the start
235and end of a restricted sort key (see
236.Fl k ) .
237If
238.Fl b
239is specified before the first
240.Fl k
241option, it applies globally to all key specifications.
242Otherwise,
243.Fl b
244can be attached independently to each
245.Ar field
246argument of the key specifications.
247.Fl b .
248.It Xo
249.Fl k Ar field1 Ns Op , Ns Ar field2 ,
250.Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2
251.Xc
252Define a restricted sort key that has the starting position
253.Ar field1 ,
254and optional ending position
255.Ar field2
256of a key field.
257The
258.Fl k
259option may be specified multiple times,
260in which case subsequent keys are compared when earlier keys compare equal.
261The
262.Fl k
263option replaces the obsolete options
264.Cm \(pl Ns Ar pos1
265and
266.Fl Ns Ar pos2 ,
267but the old notation is also supported.
268.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char
269Use
270.Ar char
271as a field separator character.
272The initial
273.Ar char
274is not considered to be part of a field when determining key offsets.
275Each occurrence of
276.Ar char
277is significant (for example,
278.Dq Ar charchar
279delimits an empty field).
280If
281.Fl t
282is not specified, the default field separator is a sequence of
283blank space characters, and consecutive blank spaces do
284.Em not
285delimit an empty field, however, the initial blank space
286.Em is
287considered part of a field when determining key offsets.
288To use NUL as field separator, use
289.Fl t
290\'\\0\'.
291.It Fl z , Fl Fl zero-terminated
292Use NUL as record separator.
293By default, records in the files are supposed to be separated by
294the newline characters.
295With this option, NUL (\'\\0\') is used as a record separator character.
296.El
297.Pp
298Other options:
299.Bl -tag -width indent
300.It Fl Fl batch-size Ns = Ns Ar num
301Specify maximum number of files that can be opened by
302.Nm
303at once.
304This option affects behavior when having many input files or using
305temporary files.
306The default value is 16.
307.It Fl Fl compress-program Ns = Ns Ar PROGRAM
308Use PROGRAM to compress temporary files.
309PROGRAM must compress standard input to standard output, when called
310without arguments.
311When called with argument
312.Fl d
313it must decompress standard input to standard output.
314If PROGRAM fails,
315.Nm
316must exit with error.
317An example of PROGRAM that can be used here is bzip2.
318.It Fl Fl random-source Ns = Ns Ar filename
319In random sort, the file content is used as the source of the 'seed' data
320for the hash function choice.
321Two invocations of random sort with the same seed data will use
322the same hash function and will produce the same result if the input is
323also identical.
324By default, file
325.Cm /dev/random
326is used.
327.It Fl Fl debug
328Print some extra information about the sorting process to the
329standard output.
330%%THREADS%%.It Fl Fl parallel
331%%THREADS%%Set the maximum number of execution threads.
332%%THREADS%%Default number equals to the number of CPUs.
333.It Fl Fl files0-from Ns = Ns Ar filename
334Take the input file list from the file
335.Ar filename .
336The file names must be separated by NUL
337(like the output produced by the command "find ... -print0").
338.It Fl Fl radixsort
339Try to use radix sort, if the sort specifications allow.
340The radix sort can only be used for trivial locales (C and POSIX),
341and it cannot be used for numeric or month sort.
342Radix sort is very fast and stable.
343.It Fl Fl mergesort
344Use mergesort.
345This is a universal algorithm that can always be used,
346but it is not always the fastest.
347.It Fl Fl qsort
348Try to use quick sort, if the sort specifications allow.
349This sort algorithm cannot be used with
350.Fl u
351and
352.Fl s .
353.It Fl Fl heapsort
354Try to use heap sort, if the sort specifications allow.
355This sort algorithm cannot be used with
356.Fl u
357and
358.Fl s .
359.It Fl Fl mmap
360Try to use file memory mapping system call.
361It may increase speed in some cases.
362.El
363.Pp
364The following operands are available:
365.Bl -tag -width indent
366.It Ar file
367The pathname of a file to be sorted, merged, or checked.
368If no
369.Ar file
370operands are specified, or if a
371.Ar file
372operand is
373.Fl ,
374the standard input is used.
375.El
376.Pp
377A field is defined as a maximal sequence of characters other than the
378field separator and record separator (newline by default).
379Initial blank spaces are included in the field unless
380.Fl b
381has been specified;
382the first blank space of a sequence of blank spaces acts as the field
383separator and is included in the field (unless
384.Fl t
385is specified).
386For example, all blank spaces at the beginning of a line are
387considered to be part of the first field.
388.Pp
389Fields are specified by the
390.Sm off
391.Fl k\ \& Ar field1 Op , Ar field2
392.Sm on
393command-line option.
394If
395.Ar field2
396is missing, the end of the key defaults to the end of the line.
397.Pp
398The arguments
399.Ar field1
400and
401.Ar field2
402have the form
403.Em m.n
404.Em (m,n > 0)
405and can be followed by one or more of the modifiers
406.Cm b , d , f , i ,
407.Cm n , g , M
408and
409.Cm r ,
410which correspond to the options discussed above.
411When
412.Cm b
413is specified it applies only to
414.Ar field1
415or
416.Ar field2
417where it is specified while the rest of the modifiers
418apply to the whole key field regardless if they are
419specified only with
420.Ar field1
421or
422.Ar field2
423or both.
424A
425.Ar field1
426position specified by
427.Em m.n
428is interpreted as the
429.Em n Ns th
430character from the beginning of the
431.Em m Ns th
432field.
433A missing
434.Em \&.n
435in
436.Ar field1
437means
438.Ql \&.1 ,
439indicating the first character of the
440.Em m Ns th
441field; if the
442.Fl b
443option is in effect,
444.Em n
445is counted from the first non-blank character in the
446.Em m Ns th
447field;
448.Em m Ns \&.1b
449refers to the first non-blank character in the
450.Em m Ns th
451field.
452.No 1\&. Ns Em n
453refers to the
454.Em n Ns th
455character from the beginning of the line;
456if
457.Em n
458is greater than the length of the line, the field is taken to be empty.
459.Pp
460.Em n Ns th
461positions are always counted from the field beginning, even if the field
462is shorter than the number of specified positions.
463Thus, the key can really start from a position in a subsequent field.
464.Pp
465A
466.Ar field2
467position specified by
468.Em m.n
469is interpreted as the
470.Em n Ns th
471character (including separators) from the beginning of the
472.Em m Ns th
473field.
474A missing
475.Em \&.n
476indicates the last character of the
477.Em m Ns th
478field;
479.Em m
480= \&0
481designates the end of a line.
482Thus the option
483.Fl k Ar v.x,w.y
484is synonymous with the obsolete option
485.Cm \(pl Ns Ar v-\&1.x-\&1
486.Fl Ns Ar w-\&1.y ;
487when
488.Em y
489is omitted,
490.Fl k Ar v.x,w
491is synonymous with
492.Cm \(pl Ns Ar v-\&1.x-\&1
493.Fl Ns Ar w\&.0 .
494The obsolete
495.Cm \(pl Ns Ar pos1
496.Fl Ns Ar pos2
497option is still supported, except for
498.Fl Ns Ar w\&.0b ,
499which has no
500.Fl k
501equivalent.
502.Sh ENVIRONMENT
503.Bl -tag -width Fl
504.It Ev LC_COLLATE
505Locale settings to be used to determine the collation for
506sorting records.
507.It Ev LC_CTYPE
508Locale settings to be used to case conversion and classification
509of characters, that is, which characters are considered
510whitespaces, etc.
511.It Ev LC_MESSAGES
512Locale settings that determine the language of output messages
513that
514.Nm
515prints out.
516.It Ev LC_NUMERIC
517Locale settings that determine the number format used in numeric sort.
518.It Ev LC_TIME
519Locale settings that determine the month format used in month sort.
520.It Ev LC_ALL
521Locale settings that override all of the above locale settings.
522This environment variable can be used to set all these settings
523to the same value at once.
524.It Ev LANG
525Used as a last resort to determine different kinds of locale-specific
526behavior if neither the respective environment variable, nor
527.Ev LC_ALL
528are set.
529.It Ev TMPDIR
530Path to the directory in which temporary files will be stored.
531Note that
532.Ev TMPDIR
533may be overridden by the
534.Fl T
535option.
536.It Ev GNUSORT_NUMERIC_COMPATIBILITY
537If defined
538.Fl t
539will not override the locale numeric symbols, that is, thousand
540separators and decimal separators.
541By default, if we specify
542.Fl t
543with the same symbol as the thousand separator or decimal point,
544the symbol will be treated as the field separator.
545Older behavior was less definite; the symbol was treated as both field
546separator and numeric separator, simultaneously.
547This environment variable enables the old behavior.
548.El
549.Sh FILES
550.Bl -tag -width Pa -compact
551.It Pa /var/tmp/.bsdsort.PID.*
552Temporary files.
553.It Pa /dev/random
554Default seed file for the random sort.
555.El
556.Sh EXIT STATUS
557The
558.Nm
559utility shall exit with one of the following values:
560.Pp
561.Bl -tag -width flag -compact
562.It 0
563Successfully sorted the input files or if used with
564.Fl c
565or
566.Fl C ,
567the input file already met the sorting criteria.
568.It 1
569On disorder (or non-uniqueness) with the
570.Fl c
571or
572.Fl C
573options.
574.It 2
575An error occurred.
576.El
577.Sh SEE ALSO
578.Xr comm 1 ,
579.Xr join 1 ,
580.Xr uniq 1
581.Sh STANDARDS
582The
583.Nm
584utility is compliant with the
585.St -p1003.1-2008
586specification.
587.Pp
588The flags
589.Op Fl ghRMSsTVz
590are extensions to the POSIX specification.
591.Pp
592All long options are extensions to the specification, some of them are
593provided for compatibility with GNU versions and some of them are
594own extensions.
595.Pp
596The old key notations
597.Cm \(pl Ns Ar pos1
598and
599.Fl Ns Ar pos2
600come from older versions of
601.Nm
602and are still supported but their use is highly discouraged.
603.Sh HISTORY
604A
605.Nm
606command first appeared in
607.At v1 .
608.Sh AUTHORS
609.An Gabor Kovesdan Aq Mt gabor@FreeBSD.org ,
610.Pp
611.An Oleg Moskalenko Aq Mt mom040267@gmail.com
612.Sh NOTES
613This implementation of
614.Nm
615has no limits on input line length (other than imposed by available
616memory) or any restrictions on bytes allowed within lines.
617.Pp
618The performance depends highly on locale settings,
619efficient choice of sort keys and key complexity.
620The fastest sort is with locale C, on whole lines,
621with option
622.Fl s .
623In general, locale C is the fastest, then single-byte
624locales follow and multi-byte locales as the slowest but
625the correct collation order is always respected.
626As for the key specification, the simpler to process the
627lines the faster the search will be.
628.Pp
629When sorting by arithmetic value, using
630.Fl n
631results in much better performance than
632.Fl g
633so its use is encouraged
634whenever possible.
635