xref: /freebsd/usr.bin/sort/sort.1.in (revision 02e9120893770924227138ba49df1edb3896112a)
1.\"	$OpenBSD: sort.1,v 1.45 2015/03/19 13:51:10 jmc Exp $
2.\"
3.\" Copyright (c) 1991, 1993
4.\"	The Regents of the University of California.  All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" the Institute of Electrical and Electronics Engineers, Inc.
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\" 3. Neither the name of the University nor the names of its contributors
18.\"    may be used to endorse or promote products derived from this software
19.\"    without specific prior written permission.
20.\"
21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31.\" SUCH DAMAGE.
32.\"
33.Dd November 30, 2023
34.Dt SORT 1
35.Os
36.Sh NAME
37.Nm sort
38.Nd sort or merge records (lines) of text and binary files
39.Sh SYNOPSIS
40.Nm
41.Bk -words
42.Op Fl bcCdfghiRMmnrsuVz
43.Sm off
44.Op Fl k\ \& Ar field1 Op , Ar field2
45.Sm on
46.Op Fl S Ar memsize
47.Ek
48.Op Fl T Ar dir
49.Op Fl t Ar char
50.Op Fl o Ar output
51.Op Ar file ...
52.Nm
53.Fl Fl help
54.Nm
55.Fl Fl version
56.Sh DESCRIPTION
57The
58.Nm
59utility sorts text and binary files by lines.
60A line is a record separated from the subsequent record by a
61newline (default) or NUL \'\\0\' character (-z option).
62A record can contain any printable or unprintable characters.
63Comparisons are based on one or more sort keys extracted from
64each line of input, and are performed lexicographically,
65according to the current locale's collating rules and the
66specified command-line options that can tune the actual
67sorting behavior.
68By default, if keys are not given,
69.Nm
70uses entire lines for comparison.
71.Pp
72The command line options are as follows:
73.Bl -tag -width Ds
74.It Fl c , Fl Fl check , Fl C , Fl Fl check=silent|quiet
75Check that the single input file is sorted.
76If the file is not sorted,
77.Nm
78produces the appropriate error messages and exits with code 1,
79otherwise returns 0.
80If
81.Fl C
82or
83.Fl Fl check=silent
84is specified,
85.Nm
86produces no output.
87This is a "silent" version of
88.Fl c .
89.It Fl m , Fl Fl merge
90Merge only.
91The input files are assumed to be pre-sorted.
92If they are not sorted the output order is undefined.
93.It Fl o Ar output , Fl Fl output Ns = Ns Ar output
94Print the output to the
95.Ar output
96file instead of the standard output.
97.It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size
98Use
99.Ar size
100for the maximum size of the memory buffer.
101Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used.
102If a memory limit is not explicitly specified,
103.Nm
104takes up to about 90% of available memory.
105If the file size is too big to fit into the memory buffer,
106the temporary disk files are used to perform the sorting.
107.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir
108Store temporary files in the directory
109.Ar dir .
110The default path is the value of the environment variable
111.Ev TMPDIR
112or
113.Pa /var/tmp
114if
115.Ev TMPDIR
116is not defined.
117.It Fl u , Fl Fl unique
118Unique keys.
119Suppress all lines that have a key that is equal to an already
120processed one.
121This option, similarly to
122.Fl s ,
123implies a stable sort.
124If used with
125.Fl c
126or
127.Fl C ,
128.Nm
129also checks that there are no lines with duplicate keys.
130.It Fl s
131Stable sort.
132This option maintains the original record order of records that have
133an equal key.
134This is a non-standard feature, but it is widely accepted and used.
135.It Fl Fl version
136Print the version and silently exits.
137.It Fl Fl help
138Print the help text and silently exits.
139.El
140.Pp
141The following options override the default ordering rules.
142When ordering options appear independently of key field
143specifications, they apply globally to all sort keys.
144When attached to a specific key (see
145.Fl k ) ,
146the ordering options override all global ordering options for
147the key they are attached to.
148.Bl -tag -width indent
149.It Fl b , Fl Fl ignore-leading-blanks
150Ignore leading blank characters when comparing lines.
151.It Fl d , Fl Fl dictionary-order
152Consider only blank spaces and alphanumeric characters in comparisons.
153.It Fl f , Fl Fl ignore-case
154Convert all lowercase characters to their uppercase equivalent
155before comparison, that is, perform case-independent sorting.
156.It Fl g , Fl Fl general-numeric-sort , Fl Fl sort=general-numeric
157Sort by general numerical value.
158As opposed to
159.Fl n ,
160this option handles general floating points.
161It has a more
162permissive format than that allowed by
163.Fl n
164but it has a significant performance drawback.
165.It Fl h , Fl Fl human-numeric-sort , Fl Fl sort=human-numeric
166Sort by numerical value, but take into account the SI suffix,
167if present.
168Sort first by numeric sign (negative, zero, or
169positive); then by SI suffix (either empty, or `k' or `K', or one
170of `MGTPEZY', in that order); and finally by numeric value.
171The SI suffix must immediately follow the number.
172For example, '12345K' sorts before '1M', because M is "larger" than K.
173This sort option is useful for sorting the output of a single invocation
174of 'df' command with
175.Fl h
176or
177.Fl H
178options (human-readable).
179.It Fl i , Fl Fl ignore-nonprinting
180Ignore all non-printable characters.
181.It Fl M , Fl Fl month-sort , Fl Fl sort=month
182Sort by month.
183Unknown strings are considered smaller than the month names.
184.It Fl n , Fl Fl numeric-sort , Fl Fl sort=numeric
185Sort fields numerically by arithmetic value.
186Fields are supposed to have optional blanks in the beginning, an
187optional minus sign, zero or more digits (including decimal point and
188possible thousand separators).
189.It Fl R , Fl Fl random-sort , Fl Fl sort=random
190Sort by a random order.
191This is a random permutation of the inputs except that
192the equal keys sort together.
193It is implemented by hashing the input keys and sorting
194the hash values.
195The hash function is chosen randomly.
196The hash function is randomized by
197.Cm /dev/random
198content, or by file content if it is specified by
199.Fl Fl random-source .
200Even if multiple sort fields are specified,
201the same random hash function is used for all of them.
202.It Fl r , Fl Fl reverse
203Sort in reverse order.
204.It Fl V , Fl Fl version-sort
205Sort version numbers.
206The input lines are treated as file names in form
207PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
208"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
209The files are compared by their prefixes and versions (leading
210zeros are ignored in version numbers, see example below).
211If an input string does not match the pattern, then it is compared
212using the byte compare function.
213All string comparisons are performed in C locale, the locale
214environment setting is ignored.
215.Bl -tag -width indent
216.It Example:
217.It $ ls sort* | sort -V
218.It sort-1.022.tgz
219.It sort-1.23.tgz
220.It sort-1.23.1.tgz
221.It sort-1.024.tgz
222.It sort-1.024.003.
223.It sort-1.024.003.tgz
224.It sort-1.024.07.tgz
225.It sort-1.024.009.tgz
226.El
227.El
228.Pp
229The treatment of field separators can be altered using these options:
230.Bl -tag -width indent
231.It Fl b , Fl Fl ignore-leading-blanks
232Ignore leading blank space when determining the start
233and end of a restricted sort key (see
234.Fl k ) .
235If
236.Fl b
237is specified before the first
238.Fl k
239option, it applies globally to all key specifications.
240Otherwise,
241.Fl b
242can be attached independently to each
243.Ar field
244argument of the key specifications.
245.Fl b .
246.It Xo
247.Fl k Ar field1 Ns Op , Ns Ar field2 ,
248.Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2
249.Xc
250Define a restricted sort key that has the starting position
251.Ar field1 ,
252and optional ending position
253.Ar field2
254of a key field.
255The
256.Fl k
257option may be specified multiple times,
258in which case subsequent keys are compared when earlier keys compare equal.
259The
260.Fl k
261option replaces the obsolete options
262.Cm \(pl Ns Ar pos1
263and
264.Fl Ns Ar pos2 ,
265but the old notation is also supported.
266.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char
267Use
268.Ar char
269as a field separator character.
270The initial
271.Ar char
272is not considered to be part of a field when determining key offsets.
273Each occurrence of
274.Ar char
275is significant (for example,
276.Dq Ar charchar
277delimits an empty field).
278If
279.Fl t
280is not specified, the default field separator is a sequence of
281blank space characters, and consecutive blank spaces do
282.Em not
283delimit an empty field, however, the initial blank space
284.Em is
285considered part of a field when determining key offsets.
286To use NUL as field separator, use
287.Fl t
288\'\\0\'.
289.It Fl z , Fl Fl zero-terminated
290Use NUL as record separator.
291By default, records in the files are supposed to be separated by
292the newline characters.
293With this option, NUL (\'\\0\') is used as a record separator character.
294.El
295.Pp
296Other options:
297.Bl -tag -width indent
298.It Fl Fl batch-size Ns = Ns Ar num
299Specify maximum number of files that can be opened by
300.Nm
301at once.
302This option affects behavior when having many input files or using
303temporary files.
304The default value is 16.
305.It Fl Fl compress-program Ns = Ns Ar PROGRAM
306Use PROGRAM to compress temporary files.
307PROGRAM must compress standard input to standard output, when called
308without arguments.
309When called with argument
310.Fl d
311it must decompress standard input to standard output.
312If PROGRAM fails,
313.Nm
314must exit with error.
315An example of PROGRAM that can be used here is bzip2.
316.It Fl Fl random-source Ns = Ns Ar filename
317In random sort, the file content is used as the source of the 'seed' data
318for the hash function choice.
319Two invocations of random sort with the same seed data will use
320the same hash function and will produce the same result if the input is
321also identical.
322By default, file
323.Cm /dev/random
324is used.
325.It Fl Fl debug
326Print some extra information about the sorting process to the
327standard output.
328%%THREADS%%.It Fl Fl parallel
329%%THREADS%%Set the maximum number of execution threads.
330%%THREADS%%Default number equals to the number of CPUs.
331.It Fl Fl files0-from Ns = Ns Ar filename
332Take the input file list from the file
333.Ar filename .
334The file names must be separated by NUL
335(like the output produced by the command "find ... -print0").
336.It Fl Fl radixsort
337Try to use radix sort, if the sort specifications allow.
338The radix sort can only be used for trivial locales (C and POSIX),
339and it cannot be used for numeric or month sort.
340Radix sort is very fast and stable.
341.It Fl Fl mergesort
342Use mergesort.
343This is a universal algorithm that can always be used,
344but it is not always the fastest.
345.It Fl Fl qsort
346Try to use quick sort, if the sort specifications allow.
347This sort algorithm cannot be used with
348.Fl u
349and
350.Fl s .
351.It Fl Fl heapsort
352Try to use heap sort, if the sort specifications allow.
353This sort algorithm cannot be used with
354.Fl u
355and
356.Fl s .
357.It Fl Fl mmap
358Try to use file memory mapping system call.
359It may increase speed in some cases.
360.El
361.Pp
362The following operands are available:
363.Bl -tag -width indent
364.It Ar file
365The pathname of a file to be sorted, merged, or checked.
366If no
367.Ar file
368operands are specified, or if a
369.Ar file
370operand is
371.Fl ,
372the standard input is used.
373.El
374.Pp
375A field is defined as a maximal sequence of characters other than the
376field separator and record separator (newline by default).
377Initial blank spaces are included in the field unless
378.Fl b
379has been specified;
380the first blank space of a sequence of blank spaces acts as the field
381separator and is included in the field (unless
382.Fl t
383is specified).
384For example, all blank spaces at the beginning of a line are
385considered to be part of the first field.
386.Pp
387Fields are specified by the
388.Sm off
389.Fl k\ \& Ar field1 Op , Ar field2
390.Sm on
391command-line option.
392If
393.Ar field2
394is missing, the end of the key defaults to the end of the line.
395.Pp
396The arguments
397.Ar field1
398and
399.Ar field2
400have the form
401.Em m.n
402.Em (m,n > 0)
403and can be followed by one or more of the modifiers
404.Cm b , d , f , i ,
405.Cm n , g , M
406and
407.Cm r ,
408which correspond to the options discussed above.
409When
410.Cm b
411is specified it applies only to
412.Ar field1
413or
414.Ar field2
415where it is specified while the rest of the modifiers
416apply to the whole key field regardless if they are
417specified only with
418.Ar field1
419or
420.Ar field2
421or both.
422A
423.Ar field1
424position specified by
425.Em m.n
426is interpreted as the
427.Em n Ns th
428character from the beginning of the
429.Em m Ns th
430field.
431A missing
432.Em \&.n
433in
434.Ar field1
435means
436.Ql \&.1 ,
437indicating the first character of the
438.Em m Ns th
439field; if the
440.Fl b
441option is in effect,
442.Em n
443is counted from the first non-blank character in the
444.Em m Ns th
445field;
446.Em m Ns \&.1b
447refers to the first non-blank character in the
448.Em m Ns th
449field.
450.No 1\&. Ns Em n
451refers to the
452.Em n Ns th
453character from the beginning of the line;
454if
455.Em n
456is greater than the length of the line, the field is taken to be empty.
457.Pp
458.Em n Ns th
459positions are always counted from the field beginning, even if the field
460is shorter than the number of specified positions.
461Thus, the key can really start from a position in a subsequent field.
462.Pp
463A
464.Ar field2
465position specified by
466.Em m.n
467is interpreted as the
468.Em n Ns th
469character (including separators) from the beginning of the
470.Em m Ns th
471field.
472A missing
473.Em \&.n
474indicates the last character of the
475.Em m Ns th
476field;
477.Em m
478= \&0
479designates the end of a line.
480Thus the option
481.Fl k Ar v.x,w.y
482is synonymous with the obsolete option
483.Cm \(pl Ns Ar v-\&1.x-\&1
484.Fl Ns Ar w-\&1.y ;
485when
486.Em y
487is omitted,
488.Fl k Ar v.x,w
489is synonymous with
490.Cm \(pl Ns Ar v-\&1.x-\&1
491.Fl Ns Ar w\&.0 .
492The obsolete
493.Cm \(pl Ns Ar pos1
494.Fl Ns Ar pos2
495option is still supported, except for
496.Fl Ns Ar w\&.0b ,
497which has no
498.Fl k
499equivalent.
500.Sh ENVIRONMENT
501.Bl -tag -width Fl
502.It Ev LC_COLLATE
503Locale settings to be used to determine the collation for
504sorting records.
505.It Ev LC_CTYPE
506Locale settings to be used to case conversion and classification
507of characters, that is, which characters are considered
508whitespaces, etc.
509.It Ev LC_MESSAGES
510Locale settings that determine the language of output messages
511that
512.Nm
513prints out.
514.It Ev LC_NUMERIC
515Locale settings that determine the number format used in numeric sort.
516.It Ev LC_TIME
517Locale settings that determine the month format used in month sort.
518.It Ev LC_ALL
519Locale settings that override all of the above locale settings.
520This environment variable can be used to set all these settings
521to the same value at once.
522.It Ev LANG
523Used as a last resort to determine different kinds of locale-specific
524behavior if neither the respective environment variable, nor
525.Ev LC_ALL
526are set.
527.It Ev TMPDIR
528Path to the directory in which temporary files will be stored.
529Note that
530.Ev TMPDIR
531may be overridden by the
532.Fl T
533option.
534.It Ev GNUSORT_NUMERIC_COMPATIBILITY
535If defined
536.Fl t
537will not override the locale numeric symbols, that is, thousand
538separators and decimal separators.
539By default, if we specify
540.Fl t
541with the same symbol as the thousand separator or decimal point,
542the symbol will be treated as the field separator.
543Older behavior was less definite; the symbol was treated as both field
544separator and numeric separator, simultaneously.
545This environment variable enables the old behavior.
546.El
547.Sh FILES
548.Bl -tag -width Pa -compact
549.It Pa /var/tmp/.bsdsort.PID.*
550Temporary files.
551.It Pa /dev/random
552Default seed file for the random sort.
553.El
554.Sh EXIT STATUS
555The
556.Nm
557utility shall exit with one of the following values:
558.Pp
559.Bl -tag -width flag -compact
560.It 0
561Successfully sorted the input files or if used with
562.Fl c
563or
564.Fl C ,
565the input file already met the sorting criteria.
566.It 1
567On disorder (or non-uniqueness) with the
568.Fl c
569or
570.Fl C
571options.
572.It 2
573An error occurred.
574.El
575.Sh SEE ALSO
576.Xr comm 1 ,
577.Xr join 1 ,
578.Xr uniq 1
579.Sh STANDARDS
580The
581.Nm
582utility is compliant with the
583.St -p1003.1-2008
584specification.
585.Pp
586The flags
587.Op Fl ghRMSsTVz
588are extensions to the POSIX specification.
589.Pp
590All long options are extensions to the specification, some of them are
591provided for compatibility with GNU versions and some of them are
592own extensions.
593.Pp
594The old key notations
595.Cm \(pl Ns Ar pos1
596and
597.Fl Ns Ar pos2
598come from older versions of
599.Nm
600and are still supported but their use is highly discouraged.
601.Sh HISTORY
602A
603.Nm
604command first appeared in
605.At v1 .
606.Sh AUTHORS
607.An Gabor Kovesdan Aq Mt gabor@FreeBSD.org ,
608.Pp
609.An Oleg Moskalenko Aq Mt mom040267@gmail.com
610.Sh NOTES
611This implementation of
612.Nm
613has no limits on input line length (other than imposed by available
614memory) or any restrictions on bytes allowed within lines.
615.Pp
616The performance depends highly on locale settings,
617efficient choice of sort keys and key complexity.
618The fastest sort is with locale C, on whole lines,
619with option
620.Fl s .
621In general, locale C is the fastest, then single-byte
622locales follow and multi-byte locales as the slowest but
623the correct collation order is always respected.
624As for the key specification, the simpler to process the
625lines the faster the search will be.
626.Pp
627When sorting by arithmetic value, using
628.Fl n
629results in much better performance than
630.Fl g
631so its use is encouraged
632whenever possible.
633