1.\" $OpenBSD: sort.1,v 1.45 2015/03/19 13:51:10 jmc Exp $ 2.\" 3.\" Copyright (c) 1991, 1993 4.\" The Regents of the University of California. All rights reserved. 5.\" 6.\" This code is derived from software contributed to Berkeley by 7.\" the Institute of Electrical and Electronics Engineers, Inc. 8.\" 9.\" Redistribution and use in source and binary forms, with or without 10.\" modification, are permitted provided that the following conditions 11.\" are met: 12.\" 1. Redistributions of source code must retain the above copyright 13.\" notice, this list of conditions and the following disclaimer. 14.\" 2. Redistributions in binary form must reproduce the above copyright 15.\" notice, this list of conditions and the following disclaimer in the 16.\" documentation and/or other materials provided with the distribution. 17.\" 3. Neither the name of the University nor the names of its contributors 18.\" may be used to endorse or promote products derived from this software 19.\" without specific prior written permission. 20.\" 21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 24.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 31.\" SUCH DAMAGE. 32.\" 33.\" @(#)sort.1 8.1 (Berkeley) 6/6/93 34.\" 35.Dd September 4, 2019 36.Dt SORT 1 37.Os 38.Sh NAME 39.Nm sort 40.Nd sort or merge records (lines) of text and binary files 41.Sh SYNOPSIS 42.Nm 43.Bk -words 44.Op Fl bcCdfghiRMmnrsuVz 45.Sm off 46.Op Fl k\ \& Ar field1 Op , Ar field2 47.Sm on 48.Op Fl S Ar memsize 49.Ek 50.Op Fl T Ar dir 51.Op Fl t Ar char 52.Op Fl o Ar output 53.Op Ar file ... 54.Nm 55.Fl Fl help 56.Nm 57.Fl Fl version 58.Sh DESCRIPTION 59The 60.Nm 61utility sorts text and binary files by lines. 62A line is a record separated from the subsequent record by a 63newline (default) or NUL \'\\0\' character (-z option). 64A record can contain any printable or unprintable characters. 65Comparisons are based on one or more sort keys extracted from 66each line of input, and are performed lexicographically, 67according to the current locale's collating rules and the 68specified command-line options that can tune the actual 69sorting behavior. 70By default, if keys are not given, 71.Nm 72uses entire lines for comparison. 73.Pp 74The command line options are as follows: 75.Bl -tag -width Ds 76.It Fl c , Fl Fl check , Fl C , Fl Fl check=silent|quiet 77Check that the single input file is sorted. 78If the file is not sorted, 79.Nm 80produces the appropriate error messages and exits with code 1, 81otherwise returns 0. 82If 83.Fl C 84or 85.Fl Fl check=silent 86is specified, 87.Nm 88produces no output. 89This is a "silent" version of 90.Fl c . 91.It Fl m , Fl Fl merge 92Merge only. 93The input files are assumed to be pre-sorted. 94If they are not sorted the output order is undefined. 95.It Fl o Ar output , Fl Fl output Ns = Ns Ar output 96Print the output to the 97.Ar output 98file instead of the standard output. 99.It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size 100Use 101.Ar size 102for the maximum size of the memory buffer. 103Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used. 104If a memory limit is not explicitly specified, 105.Nm 106takes up to about 90% of available memory. 107If the file size is too big to fit into the memory buffer, 108the temporary disk files are used to perform the sorting. 109.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir 110Store temporary files in the directory 111.Ar dir . 112The default path is the value of the environment variable 113.Ev TMPDIR 114or 115.Pa /var/tmp 116if 117.Ev TMPDIR 118is not defined. 119.It Fl u , Fl Fl unique 120Unique keys. 121Suppress all lines that have a key that is equal to an already 122processed one. 123This option, similarly to 124.Fl s , 125implies a stable sort. 126If used with 127.Fl c 128or 129.Fl C , 130.Nm 131also checks that there are no lines with duplicate keys. 132.It Fl s 133Stable sort. 134This option maintains the original record order of records that have 135an equal key. 136This is a non-standard feature, but it is widely accepted and used. 137.It Fl Fl version 138Print the version and silently exits. 139.It Fl Fl help 140Print the help text and silently exits. 141.El 142.Pp 143The following options override the default ordering rules. 144When ordering options appear independently of key field 145specifications, they apply globally to all sort keys. 146When attached to a specific key (see 147.Fl k ) , 148the ordering options override all global ordering options for 149the key they are attached to. 150.Bl -tag -width indent 151.It Fl b , Fl Fl ignore-leading-blanks 152Ignore leading blank characters when comparing lines. 153.It Fl d , Fl Fl dictionary-order 154Consider only blank spaces and alphanumeric characters in comparisons. 155.It Fl f , Fl Fl ignore-case 156Convert all lowercase characters to their uppercase equivalent 157before comparison, that is, perform case-independent sorting. 158.It Fl g , Fl Fl general-numeric-sort , Fl Fl sort=general-numeric 159Sort by general numerical value. 160As opposed to 161.Fl n , 162this option handles general floating points. 163It has a more 164permissive format than that allowed by 165.Fl n 166but it has a significant performance drawback. 167.It Fl h , Fl Fl human-numeric-sort , Fl Fl sort=human-numeric 168Sort by numerical value, but take into account the SI suffix, 169if present. 170Sort first by numeric sign (negative, zero, or 171positive); then by SI suffix (either empty, or `k' or `K', or one 172of `MGTPEZY', in that order); and finally by numeric value. 173The SI suffix must immediately follow the number. 174For example, '12345K' sorts before '1M', because M is "larger" than K. 175This sort option is useful for sorting the output of a single invocation 176of 'df' command with 177.Fl h 178or 179.Fl H 180options (human-readable). 181.It Fl i , Fl Fl ignore-nonprinting 182Ignore all non-printable characters. 183.It Fl M , Fl Fl month-sort , Fl Fl sort=month 184Sort by month abbreviations. 185Unknown strings are considered smaller than the month names. 186.It Fl n , Fl Fl numeric-sort , Fl Fl sort=numeric 187Sort fields numerically by arithmetic value. 188Fields are supposed to have optional blanks in the beginning, an 189optional minus sign, zero or more digits (including decimal point and 190possible thousand separators). 191.It Fl R , Fl Fl random-sort , Fl Fl sort=random 192Sort by a random order. 193This is a random permutation of the inputs except that 194the equal keys sort together. 195It is implemented by hashing the input keys and sorting 196the hash values. 197The hash function is chosen randomly. 198The hash function is randomized by 199.Cm /dev/random 200content, or by file content if it is specified by 201.Fl Fl random-source . 202Even if multiple sort fields are specified, 203the same random hash function is used for all of them. 204.It Fl r , Fl Fl reverse 205Sort in reverse order. 206.It Fl V , Fl Fl version-sort 207Sort version numbers. 208The input lines are treated as file names in form 209PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression 210"(\.([A-Za-z~][A-Za-z0-9~]*)?)*". 211The files are compared by their prefixes and versions (leading 212zeros are ignored in version numbers, see example below). 213If an input string does not match the pattern, then it is compared 214using the byte compare function. 215All string comparisons are performed in C locale, the locale 216environment setting is ignored. 217.Bl -tag -width indent 218.It Example: 219.It $ ls sort* | sort -V 220.It sort-1.022.tgz 221.It sort-1.23.tgz 222.It sort-1.23.1.tgz 223.It sort-1.024.tgz 224.It sort-1.024.003. 225.It sort-1.024.003.tgz 226.It sort-1.024.07.tgz 227.It sort-1.024.009.tgz 228.El 229.El 230.Pp 231The treatment of field separators can be altered using these options: 232.Bl -tag -width indent 233.It Fl b , Fl Fl ignore-leading-blanks 234Ignore leading blank space when determining the start 235and end of a restricted sort key (see 236.Fl k ) . 237If 238.Fl b 239is specified before the first 240.Fl k 241option, it applies globally to all key specifications. 242Otherwise, 243.Fl b 244can be attached independently to each 245.Ar field 246argument of the key specifications. 247.Fl b . 248.It Xo 249.Fl k Ar field1 Ns Op , Ns Ar field2 , 250.Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2 251.Xc 252Define a restricted sort key that has the starting position 253.Ar field1 , 254and optional ending position 255.Ar field2 256of a key field. 257The 258.Fl k 259option may be specified multiple times, 260in which case subsequent keys are compared when earlier keys compare equal. 261The 262.Fl k 263option replaces the obsolete options 264.Cm \(pl Ns Ar pos1 265and 266.Fl Ns Ar pos2 , 267but the old notation is also supported. 268.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char 269Use 270.Ar char 271as a field separator character. 272The initial 273.Ar char 274is not considered to be part of a field when determining key offsets. 275Each occurrence of 276.Ar char 277is significant (for example, 278.Dq Ar charchar 279delimits an empty field). 280If 281.Fl t 282is not specified, the default field separator is a sequence of 283blank space characters, and consecutive blank spaces do 284.Em not 285delimit an empty field, however, the initial blank space 286.Em is 287considered part of a field when determining key offsets. 288To use NUL as field separator, use 289.Fl t 290\'\\0\'. 291.It Fl z , Fl Fl zero-terminated 292Use NUL as record separator. 293By default, records in the files are supposed to be separated by 294the newline characters. 295With this option, NUL (\'\\0\') is used as a record separator character. 296.El 297.Pp 298Other options: 299.Bl -tag -width indent 300.It Fl Fl batch-size Ns = Ns Ar num 301Specify maximum number of files that can be opened by 302.Nm 303at once. 304This option affects behavior when having many input files or using 305temporary files. 306The default value is 16. 307.It Fl Fl compress-program Ns = Ns Ar PROGRAM 308Use PROGRAM to compress temporary files. 309PROGRAM must compress standard input to standard output, when called 310without arguments. 311When called with argument 312.Fl d 313it must decompress standard input to standard output. 314If PROGRAM fails, 315.Nm 316must exit with error. 317An example of PROGRAM that can be used here is bzip2. 318.It Fl Fl random-source Ns = Ns Ar filename 319In random sort, the file content is used as the source of the 'seed' data 320for the hash function choice. 321Two invocations of random sort with the same seed data will use 322the same hash function and will produce the same result if the input is 323also identical. 324By default, file 325.Cm /dev/random 326is used. 327.It Fl Fl debug 328Print some extra information about the sorting process to the 329standard output. 330%%THREADS%%.It Fl Fl parallel 331%%THREADS%%Set the maximum number of execution threads. 332%%THREADS%%Default number equals to the number of CPUs. 333.It Fl Fl files0-from Ns = Ns Ar filename 334Take the input file list from the file 335.Ar filename . 336The file names must be separated by NUL 337(like the output produced by the command "find ... -print0"). 338.It Fl Fl radixsort 339Try to use radix sort, if the sort specifications allow. 340The radix sort can only be used for trivial locales (C and POSIX), 341and it cannot be used for numeric or month sort. 342Radix sort is very fast and stable. 343.It Fl Fl mergesort 344Use mergesort. 345This is a universal algorithm that can always be used, 346but it is not always the fastest. 347.It Fl Fl qsort 348Try to use quick sort, if the sort specifications allow. 349This sort algorithm cannot be used with 350.Fl u 351and 352.Fl s . 353.It Fl Fl heapsort 354Try to use heap sort, if the sort specifications allow. 355This sort algorithm cannot be used with 356.Fl u 357and 358.Fl s . 359.It Fl Fl mmap 360Try to use file memory mapping system call. 361It may increase speed in some cases. 362.El 363.Pp 364The following operands are available: 365.Bl -tag -width indent 366.It Ar file 367The pathname of a file to be sorted, merged, or checked. 368If no 369.Ar file 370operands are specified, or if a 371.Ar file 372operand is 373.Fl , 374the standard input is used. 375.El 376.Pp 377A field is defined as a maximal sequence of characters other than the 378field separator and record separator (newline by default). 379Initial blank spaces are included in the field unless 380.Fl b 381has been specified; 382the first blank space of a sequence of blank spaces acts as the field 383separator and is included in the field (unless 384.Fl t 385is specified). 386For example, all blank spaces at the beginning of a line are 387considered to be part of the first field. 388.Pp 389Fields are specified by the 390.Sm off 391.Fl k\ \& Ar field1 Op , Ar field2 392.Sm on 393command-line option. 394If 395.Ar field2 396is missing, the end of the key defaults to the end of the line. 397.Pp 398The arguments 399.Ar field1 400and 401.Ar field2 402have the form 403.Em m.n 404.Em (m,n > 0) 405and can be followed by one or more of the modifiers 406.Cm b , d , f , i , 407.Cm n , g , M 408and 409.Cm r , 410which correspond to the options discussed above. 411When 412.Cm b 413is specified it applies only to 414.Ar field1 415or 416.Ar field2 417where it is specified while the rest of the modifiers 418apply to the whole key field regardless if they are 419specified only with 420.Ar field1 421or 422.Ar field2 423or both. 424A 425.Ar field1 426position specified by 427.Em m.n 428is interpreted as the 429.Em n Ns th 430character from the beginning of the 431.Em m Ns th 432field. 433A missing 434.Em \&.n 435in 436.Ar field1 437means 438.Ql \&.1 , 439indicating the first character of the 440.Em m Ns th 441field; if the 442.Fl b 443option is in effect, 444.Em n 445is counted from the first non-blank character in the 446.Em m Ns th 447field; 448.Em m Ns \&.1b 449refers to the first non-blank character in the 450.Em m Ns th 451field. 452.No 1\&. Ns Em n 453refers to the 454.Em n Ns th 455character from the beginning of the line; 456if 457.Em n 458is greater than the length of the line, the field is taken to be empty. 459.Pp 460.Em n Ns th 461positions are always counted from the field beginning, even if the field 462is shorter than the number of specified positions. 463Thus, the key can really start from a position in a subsequent field. 464.Pp 465A 466.Ar field2 467position specified by 468.Em m.n 469is interpreted as the 470.Em n Ns th 471character (including separators) from the beginning of the 472.Em m Ns th 473field. 474A missing 475.Em \&.n 476indicates the last character of the 477.Em m Ns th 478field; 479.Em m 480= \&0 481designates the end of a line. 482Thus the option 483.Fl k Ar v.x,w.y 484is synonymous with the obsolete option 485.Cm \(pl Ns Ar v-\&1.x-\&1 486.Fl Ns Ar w-\&1.y ; 487when 488.Em y 489is omitted, 490.Fl k Ar v.x,w 491is synonymous with 492.Cm \(pl Ns Ar v-\&1.x-\&1 493.Fl Ns Ar w\&.0 . 494The obsolete 495.Cm \(pl Ns Ar pos1 496.Fl Ns Ar pos2 497option is still supported, except for 498.Fl Ns Ar w\&.0b , 499which has no 500.Fl k 501equivalent. 502.Sh ENVIRONMENT 503.Bl -tag -width Fl 504.It Ev LC_COLLATE 505Locale settings to be used to determine the collation for 506sorting records. 507.It Ev LC_CTYPE 508Locale settings to be used to case conversion and classification 509of characters, that is, which characters are considered 510whitespaces, etc. 511.It Ev LC_MESSAGES 512Locale settings that determine the language of output messages 513that 514.Nm 515prints out. 516.It Ev LC_NUMERIC 517Locale settings that determine the number format used in numeric sort. 518.It Ev LC_TIME 519Locale settings that determine the month format used in month sort. 520.It Ev LC_ALL 521Locale settings that override all of the above locale settings. 522This environment variable can be used to set all these settings 523to the same value at once. 524.It Ev LANG 525Used as a last resort to determine different kinds of locale-specific 526behavior if neither the respective environment variable, nor 527.Ev LC_ALL 528are set. 529.It Ev TMPDIR 530Path to the directory in which temporary files will be stored. 531Note that 532.Ev TMPDIR 533may be overridden by the 534.Fl T 535option. 536.It Ev GNUSORT_NUMERIC_COMPATIBILITY 537If defined 538.Fl t 539will not override the locale numeric symbols, that is, thousand 540separators and decimal separators. 541By default, if we specify 542.Fl t 543with the same symbol as the thousand separator or decimal point, 544the symbol will be treated as the field separator. 545Older behavior was less definite; the symbol was treated as both field 546separator and numeric separator, simultaneously. 547This environment variable enables the old behavior. 548.El 549.Sh FILES 550.Bl -tag -width Pa -compact 551.It Pa /var/tmp/.bsdsort.PID.* 552Temporary files. 553.It Pa /dev/random 554Default seed file for the random sort. 555.El 556.Sh EXIT STATUS 557The 558.Nm 559utility shall exit with one of the following values: 560.Pp 561.Bl -tag -width flag -compact 562.It 0 563Successfully sorted the input files or if used with 564.Fl c 565or 566.Fl C , 567the input file already met the sorting criteria. 568.It 1 569On disorder (or non-uniqueness) with the 570.Fl c 571or 572.Fl C 573options. 574.It 2 575An error occurred. 576.El 577.Sh SEE ALSO 578.Xr comm 1 , 579.Xr join 1 , 580.Xr uniq 1 581.Sh STANDARDS 582The 583.Nm 584utility is compliant with the 585.St -p1003.1-2008 586specification. 587.Pp 588The flags 589.Op Fl ghRMSsTVz 590are extensions to the POSIX specification. 591.Pp 592All long options are extensions to the specification, some of them are 593provided for compatibility with GNU versions and some of them are 594own extensions. 595.Pp 596The old key notations 597.Cm \(pl Ns Ar pos1 598and 599.Fl Ns Ar pos2 600come from older versions of 601.Nm 602and are still supported but their use is highly discouraged. 603.Sh HISTORY 604A 605.Nm 606command first appeared in 607.At v1 . 608.Sh AUTHORS 609.An Gabor Kovesdan Aq Mt gabor@FreeBSD.org , 610.Pp 611.An Oleg Moskalenko Aq Mt mom040267@gmail.com 612.Sh NOTES 613This implementation of 614.Nm 615has no limits on input line length (other than imposed by available 616memory) or any restrictions on bytes allowed within lines. 617.Pp 618The performance depends highly on locale settings, 619efficient choice of sort keys and key complexity. 620The fastest sort is with locale C, on whole lines, 621with option 622.Fl s . 623In general, locale C is the fastest, then single-byte 624locales follow and multi-byte locales as the slowest but 625the correct collation order is always respected. 626As for the key specification, the simpler to process the 627lines the faster the search will be. 628.Pp 629When sorting by arithmetic value, using 630.Fl n 631results in much better performance than 632.Fl g 633so its use is encouraged 634whenever possible. 635