1.\" $OpenBSD: sort.1,v 1.31 2007/08/21 21:22:37 millert Exp $ 2.\" $FreeBSD$ 3.\" 4.\" Copyright (c) 1991, 1993 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" the Institute of Electrical and Electronics Engineers, Inc. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. Neither the name of the University nor the names of its contributors 19.\" may be used to endorse or promote products derived from this software 20.\" without specific prior written permission. 21.\" 22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 25.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 32.\" SUCH DAMAGE. 33.\" 34.\" @(#)sort.1 8.1 (Berkeley) 6/6/93 35.\" 36.Dd July 3, 2012 37.Dt SORT 1 38.Os 39.Sh NAME 40.Nm sort 41.Nd sort or merge records (lines) of text and binary files 42.Sh SYNOPSIS 43.Nm sort 44.Bk -words 45.Op Fl bcCdfghiRMmnrsuVz 46.Sm off 47.Op Fl k\ \& Ar field1 Op , Ar field2 48.Sm on 49.Op Fl S Ar memsize 50.Ek 51.Op Fl T Ar dir 52.Op Fl t Ar char 53.Op Fl o Ar output 54.Op Ar file ... 55.Nm sort 56.Fl Fl help 57.Nm sort 58.Fl Fl version 59.Sh DESCRIPTION 60The 61.Nm 62utility sorts text and binary files by lines. 63A line is a record separated from the subsequent record by a 64newline (default) or NUL \'\\0\' character (-z option). 65A record can contain any printable or unprintable characters. 66Comparisons are based on one or more sort keys extracted from 67each line of input, and are performed lexicographically, 68according to the current locale's collating rules and the 69specified command-line options that can tune the actual 70sorting behavior. 71By default, if keys are not given, 72.Nm 73uses entire lines for comparison. 74.Pp 75The command line options are as follows: 76.Bl -tag -width Ds 77.It Fl c, Fl Fl check, Fl C, Fl Fl check=silent|quiet 78Check that the single input file is sorted. 79If the file is not sorted, 80.Nm 81produces the appropriate error messages and exits with code 1, 82otherwise returns 0. 83If 84.Fl C 85or 86.Fl Fl check=silent 87is specified, 88.Nm 89produces no output. 90This is a "silent" version of 91.Fl c. 92.It Fl m , Fl Fl merge 93Merge only. 94The input files are assumed to be pre-sorted. 95If they are not sorted the output order is undefined. 96.It Fl o Ar output , Fl Fl output Ns = Ns Ar output 97Print the output to the 98.Ar output 99file instead of the standard output. 100.It Fl S Ar size, Fl Fl buffer-size Ns = Ns Ar size 101Use 102.Ar size 103for the maximum size of the memory buffer. 104Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used. 105If a memory limit is not explicitly specified, 106.Nm 107takes up to about 90% of available memory. 108If the file size is too big to fit into the memory buffer, 109the temporary disk files are used to perform the sorting. 110.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir 111Store temporary files in the directory 112.Ar dir . 113The default path is the value of the environment variable 114.Ev TMPDIR 115or 116.Pa /var/tmp 117if 118.Ev TMPDIR 119is not defined. 120.It Fl u , Fl Fl unique 121Unique keys. 122Suppress all lines that have a key that is equal to an already 123processed one. 124This option, similarly to 125.Fl s , 126implies a stable sort. 127If used with 128.Fl c 129or 130.Fl C , 131.Nm 132also checks that there are no lines with duplicate keys. 133.It Fl s 134Stable sort. 135This option maintains the original record order of records that have 136and equal key. 137This is a non-standard feature, but it is widely accepted and used. 138.It Fl Fl version 139Print the version and silently exits. 140.It Fl Fl help 141Print the help text and silently exits. 142.El 143.Pp 144The following options override the default ordering rules. 145When ordering options appear independently of key field 146specifications, they apply globally to all sort keys. 147When attached to a specific key (see 148.Fl k ) , 149the ordering options override all global ordering options for 150the key they are attached to. 151.Bl -tag -width indent 152.It Fl b, Fl Fl ignore-leading-blanks 153Ignore leading blank characters when comparing lines. 154.It Fl d , Fl Fl dictionary-order 155Consider only blank spaces and alphanumeric characters in comparisons. 156.It Fl f , Fl Fl ignore-case 157Convert all lowercase characters to their uppercase equivalent 158before comparison, that is, perform case-independent sorting. 159.It Fl g, Fl Fl general-numeric-sort, Fl Fl sort=general-numeric 160Sort by general numerical value. 161As opposed to 162.Fl n , 163this option handles general floating points, which have a much 164permissive format than those allowed by 165. Fl n , 166but it has a significant performance drawback. 167.It Fl h, Fl Fl human-numeric-sort, Fl Fl sort=human-numeric 168Sort by numerical value, but take into account the SI suffix, 169if present. 170Sort first by numeric sign (negative, zero, or 171positive); then by SI suffix (either empty, or `k' or `K', or one 172of `MGTPEZY', in that order); and finally by numeric value. 173The SI suffix must immediately follow the number. 174For example, '12345K' sorts before '1M', because M is "larger" than K. 175This sort option is useful for sorting the output of a single invocation 176of 'df' command with 177.Fl h 178or 179.Fl H 180options (human-readable). 181.It Fl i , Fl Fl ignore-nonprinting 182Ignore all non-printable characters. 183.It Fl M, Fl Fl month-sort, Fl Fl sort=month 184Sort by month abbreviations. 185Unknown strings are considered smaller than the month names. 186.It Fl n , Fl Fl numeric-sort, Fl Fl sort=numeric 187Sort fields numerically by arithmetic value. 188Fields are supposed to have optional blanks in the beginning, an 189optional minus sign, zero or more digits (including decimal point and 190possible thousand separators). 191.It Fl R, Fl Fl random-sort, Fl Fl sort=random 192Sort by a random order. 193This is a random permutation of the inputs except that 194the equal keys sort together. 195It is implemented by hashing the input keys and sorting 196the hash values. 197The hash function is chosen randomly. 198The hash function is randomized by 199.Cm /dev/random 200content, or by file content if it is specified by 201.Fl Fl random-source . 202Even if multiple sort fields are specified, 203the same random hash function is used for all of them. 204.It Fl r , Fl Fl reverse 205Sort in reverse order. 206.It Fl V, Fl Fl version-sort 207Sort version numbers. 208The input lines are treated as file names in form 209PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression 210"(\.([A-Za-z~][A-Za-z0-9~]*)?)*". 211The files are compared by their prefixes and versions (leading 212zeros are ignored in version numbers, see example below). 213If an input string does not match the pattern, then it is compared 214using the byte compare function. 215All string comparisons are performed in C locale, the locale 216environment setting is ignored. 217.Bl -tag -width indent 218.It Example: 219.It $ ls sort* | sort -V 220.It sort-1.022.tgz 221.It sort-1.23.tgz 222.It sort-1.23.1.tgz 223.It sort-1.024.tgz 224.It sort-1.024.003. 225.It sort-1.024.003.tgz 226.It sort-1.024.07.tgz 227.It sort-1.024.009.tgz 228.El 229.El 230.Pp 231The treatment of field separators can be altered using these options: 232.Bl -tag -width indent 233.It Fl b , Fl Fl ignore-leading-blanks 234Ignore leading blank space when determining the start 235and end of a restricted sort key (see 236.Fl k 237). 238If 239.Fl b 240is specified before the first 241.Fl k 242option, it applies globally to all key specifications. 243Otherwise, 244.Fl b 245can be attached independently to each 246.Ar field 247argument of the key specifications. 248.Fl b . 249.It Xo 250.Sm off 251.Fl k\ \& Ar field1 Op , Ar field2 , Fl Fl key Ns = Ns Ar field1 Op , Ar field2 252.Sm on 253.Xc 254Define a restricted sort key that has the starting position 255.Ar field1 , 256and optional ending position 257.Ar field2 258of a key field. 259The 260.Fl k 261option may be specified multiple times, 262in which case subsequent keys are compared when earlier keys compare equal. 263The 264.Fl k 265option replaces the obsolete options 266.Cm \(pl Ns Ar pos1 267and 268.Fl Ns Ar pos2 , 269but the old notation is also supported. 270.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char 271Use 272.Ar char 273as a field separator character. 274The initial 275.Ar char 276is not considered to be part of a field when determining key offsets. 277Each occurrence of 278.Ar char 279is significant (for example, 280.Dq Ar charchar 281delimits an empty field). 282If 283.Fl t 284is not specified, the default field separator is a sequence of 285blank space characters, and consecutive blank spaces do 286.Em not 287delimit an empty field, however, the initial blank space 288.Em is 289considered part of a field when determining key offsets. 290To use NUL as field separator, use 291.Fl t 292\'\\0\'. 293.It Fl z , Fl Fl zero-terminated 294Use NUL as record separator. 295By default, records in the files are supposed to be separated by 296the newline characters. 297With this option, NUL (\'\\0\') is used as a record separator character. 298.El 299.Pp 300Other options: 301.Bl -tag -width indent 302.It Fl Fl batch-size Ns = Ns Ar num 303Specify maximum number of files that can be opened by 304.Nm 305at once. 306This option affects behavior when having many input files or using 307temporary files. 308The default value is 16. 309.It Fl Fl compress-program Ns = Ns Ar PROGRAM 310Use PROGRAM to compress temporary files. 311PROGRAM must compress standard input to standard output, when called 312without arguments. 313When called with argument 314.Fl d 315it must decompress standard input to standard output. 316If PROGRAM fails, 317.Nm 318must exit with error. 319An example of PROGRAM that can be used here is bzip2. 320.It Fl Fl random-source Ns = Ns Ar filename 321In random sort, the file content is used as the source of the 'seed' data 322for the hash function choice. 323Two invocations of random sort with the same seed data will use 324the same hash function and will produce the same result if the input is 325also identical. 326By default, file 327.Cm /dev/random 328is used. 329.It Fl Fl debug 330Print some extra information about the sorting process to the 331standard output. 332%%THREADS%%.It Fl Fl parallel 333%%THREADS%%Set the maximum number of execution threads. 334%%THREADS%%Default number equals to the number of CPUs. 335.It Fl Fl files0-from Ns = Ns Ar filename 336Take the input file list from the file 337.Ar filename. 338The file names must be separated by NUL 339(like the output produced by the command "find ... -print0"). 340.It Fl Fl radixsort 341Try to use radix sort, if the sort specifications allow. 342The radix sort can only be used for trivial locales (C and POSIX), 343and it cannot be used for numeric or month sort. 344Radix sort is very fast and stable. 345.It Fl Fl mergesort 346Use mergesort. 347This is a universal algorithm that can always be used, 348but it is not always the fastest. 349.It Fl Fl qsort 350Try to use quick sort, if the sort specifications allow. 351This sort algorithm cannot be used with 352.Fl u 353and 354.Fl s . 355.It Fl Fl heapsort 356Try to use heap sort, if the sort specifications allow. 357This sort algorithm cannot be used with 358.Fl u 359and 360.Fl s . 361.It Fl Fl mmap 362Try to use file memory mapping system call. 363It may increase speed in some cases. 364.El 365.Pp 366The following operands are available: 367.Bl -tag -width indent 368.It Ar file 369The pathname of a file to be sorted, merged, or checked. 370If no 371.Ar file 372operands are specified, or if a 373.Ar file 374operand is 375.Fl , 376the standard input is used. 377.El 378.Pp 379A field is defined as a maximal sequence of characters other than the 380field separator and record separator (newline by default). 381Initial blank spaces are included in the field unless 382.Fl b 383has been specified; 384the first blank space of a sequence of blank spaces acts as the field 385separator and is included in the field (unless 386.Fl t 387is specified). 388For example, all blank spaces at the beginning of a line are 389considered to be part of the first field. 390.Pp 391Fields are specified by the 392.Sm off 393.Fl k\ \& Ar field1 Op , Ar field2 394.Sm on 395command-line option. 396If 397.Ar field2 398is missing, the end of the key defaults to the end of the line. 399.Pp 400The arguments 401.Ar field1 402and 403.Ar field2 404have the form 405.Em m.n 406.Em (m,n > 0) 407and can be followed by one or more of the modifiers 408.Cm b , d , f , i , 409.Cm n , g , M 410and 411.Cm r , 412which correspond to the options discussed above. 413When 414.Cm b 415is specified it applies only to 416.Ar field1 417or 418.Ar field2 419where it is specified while the rest of the modifiers 420apply to the whole key field regardless if they are 421specified only with 422.Ar field1 423or 424.Ar field2 425or both. 426A 427.Ar field1 428position specified by 429.Em m.n 430is interpreted as the 431.Em n Ns th 432character from the beginning of the 433.Em m Ns th 434field. 435A missing 436.Em \&.n 437in 438.Ar field1 439means 440.Ql \&.1 , 441indicating the first character of the 442.Em m Ns th 443field; if the 444.Fl b 445option is in effect, 446.Em n 447is counted from the first non-blank character in the 448.Em m Ns th 449field; 450.Em m Ns \&.1b 451refers to the first non-blank character in the 452.Em m Ns th 453field. 454.No 1\&. Ns Em n 455refers to the 456.Em n Ns th 457character from the beginning of the line; 458if 459.Em n 460is greater than the length of the line, the field is taken to be empty. 461.Pp 462.Em n Ns th 463positions are always counted from the field beginning, even if the field 464is shorter than the number of specified positions. 465Thus, the key can really start from a position in a subsequent field. 466.Pp 467A 468.Ar field2 469position specified by 470.Em m.n 471is interpreted as the 472.Em n Ns th 473character (including separators) from the beginning of the 474.Em m Ns th 475field. 476A missing 477.Em \&.n 478indicates the last character of the 479.Em m Ns th 480field; 481.Em m 482= \&0 483designates the end of a line. 484Thus the option 485.Fl k Ar v.x,w.y 486is synonymous with the obsolete option 487.Cm \(pl Ns Ar v-\&1.x-\&1 488.Fl Ns Ar w-\&1.y ; 489when 490.Em y 491is omitted, 492.Fl k Ar v.x,w 493is synonymous with 494.Cm \(pl Ns Ar v-\&1.x-\&1 495.Fl Ns Ar w\&.0 . 496The obsolete 497.Cm \(pl Ns Ar pos1 498.Fl Ns Ar pos2 499option is still supported, except for 500.Fl Ns Ar w\&.0b , 501which has no 502.Fl k 503equivalent. 504.Sh ENVIRONMENT 505.Bl -tag -width Fl 506.It Ev LC_COLLATE 507Locale settings to be used to determine the collation for 508sorting records. 509.It Ev LC_CTYPE 510Locale settings to be used to case conversion and classification 511of characters, that is, which characters are considered 512whitespaces, etc. 513.It Ev LC_MESSAGES 514Locale settings that determine the language of output messages 515that 516.Nm 517prints out. 518.It Ev LC_NUMERIC 519Locale settings that determine the number format used in numeric sort. 520.It Ev LC_TIME 521Locale settings that determine the month format used in month sort. 522.It Ev LC_ALL 523Locale settings that override all of the above locale settings. 524This environment variable can be used to set all these settings 525to the same value at once. 526.It Ev LANG 527Used as a last resort to determine different kinds of locale-specific 528behavior if neither the respective environment variable, nor 529.Ev LC_ALL 530are set. 531%%NLS%%.It Ev NLSPATH 532%%NLS%%Path to NLS catalogs. 533.It Ev TMPDIR 534Path to the directory in which temporary files will be stored. 535Note that 536.Ev TMPDIR 537may be overridden by the 538.Fl T 539option. 540.It Ev GNUSORT_NUMERIC_COMPATIBILITY 541If defined 542.Fl t 543will not override the locale numeric symbols, that is, thousand 544separators and decimal separators. 545By default, if we specify 546.Fl t 547with the same symbol as the thousand separator or decimal point, 548the symbol will be treated as the field separator. 549Older behavior was less definite; the symbol was treated as both field 550separator and numeric separator, simultaneously. 551This environment variable enables the old behavior. 552.El 553.Sh FILES 554.Bl -tag -width Pa -compact 555.It Pa /var/tmp/.bsdsort.PID.* 556Temporary files. 557.It Pa /dev/random 558Default seed file for the random sort. 559.El 560.Sh EXIT STATUS 561The 562.Nm 563utility shall exit with one of the following values: 564.Pp 565.Bl -tag -width flag -compact 566.It 0 567Successfully sorted the input files or if used with 568.Fl c 569or 570.Fl C , 571the input file already met the sorting criteria. 572.It 1 573On disorder (or non-uniqueness) with the 574.Fl c 575or 576.Fl C 577options. 578.It 2 579An error occurred. 580.El 581.Sh SEE ALSO 582.Xr comm 1 , 583.Xr join 1 , 584.Xr uniq 1 585.Sh STANDARDS 586The 587.Nm 588utility is compliant with the 589.St -p1003.1-2008 590specification. 591.Pp 592The flags 593.Op Fl ghRMSsTVz 594are extensions to the POSIX specification. 595.Pp 596All long options are extensions to the specification, some of them are 597provided for compatibility with GNU versions and some of them are 598own extensions. 599.Pp 600The old key notations 601.Cm \(pl Ns Ar pos1 602and 603.Fl Ns Ar pos2 604come from older versions of 605.Nm 606and are still supported but their use is highly discouraged. 607.Sh HISTORY 608A 609.Nm 610command first appeared in 611.At v3 . 612.Sh AUTHORS 613.An Gabor Kovesdan Aq Mt gabor@FreeBSD.org , 614.Pp 615.An Oleg Moskalenko Aq Mt mom040267@gmail.com 616.Sh NOTES 617This implementation of 618.Nm 619has no limits on input line length (other than imposed by available 620memory) or any restrictions on bytes allowed within lines. 621.Pp 622The performance depends highly on locale settings, 623efficient choice of sort keys and key complexity. 624The fastest sort is with locale C, on whole lines, 625with option 626.Fl s. 627In general, locale C is the fastest, then single-byte 628locales follow and multi-byte locales as the slowest but 629the correct collation order is always respected. 630As for the key specification, the simpler to process the 631lines the faster the search will be. 632.Pp 633When sorting by arithmetic value, using 634.Fl n 635results in much better performance than 636.Fl g 637so its use is encouraged 638whenever possible. 639