1.\" $OpenBSD: sort.1,v 1.45 2015/03/19 13:51:10 jmc Exp $ 2.\" $FreeBSD$ 3.\" 4.\" Copyright (c) 1991, 1993 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" the Institute of Electrical and Electronics Engineers, Inc. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. Neither the name of the University nor the names of its contributors 19.\" may be used to endorse or promote products derived from this software 20.\" without specific prior written permission. 21.\" 22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 25.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 32.\" SUCH DAMAGE. 33.\" 34.\" @(#)sort.1 8.1 (Berkeley) 6/6/93 35.\" 36.Dd March 19, 2015 37.Dt SORT 1 38.Os 39.Sh NAME 40.Nm sort 41.Nd sort or merge records (lines) of text and binary files 42.Sh SYNOPSIS 43.Nm 44.Bk -words 45.Op Fl bcCdfghiRMmnrsuVz 46.Sm off 47.Op Fl k\ \& Ar field1 Op , Ar field2 48.Sm on 49.Op Fl S Ar memsize 50.Ek 51.Op Fl T Ar dir 52.Op Fl t Ar char 53.Op Fl o Ar output 54.Op Ar file ... 55.Nm 56.Fl Fl help 57.Nm 58.Fl Fl version 59.Sh DESCRIPTION 60The 61.Nm 62utility sorts text and binary files by lines. 63A line is a record separated from the subsequent record by a 64newline (default) or NUL \'\\0\' character (-z option). 65A record can contain any printable or unprintable characters. 66Comparisons are based on one or more sort keys extracted from 67each line of input, and are performed lexicographically, 68according to the current locale's collating rules and the 69specified command-line options that can tune the actual 70sorting behavior. 71By default, if keys are not given, 72.Nm 73uses entire lines for comparison. 74.Pp 75The command line options are as follows: 76.Bl -tag -width Ds 77.It Fl c , Fl Fl check , Fl C , Fl Fl check=silent|quiet 78Check that the single input file is sorted. 79If the file is not sorted, 80.Nm 81produces the appropriate error messages and exits with code 1, 82otherwise returns 0. 83If 84.Fl C 85or 86.Fl Fl check=silent 87is specified, 88.Nm 89produces no output. 90This is a "silent" version of 91.Fl c . 92.It Fl m , Fl Fl merge 93Merge only. 94The input files are assumed to be pre-sorted. 95If they are not sorted the output order is undefined. 96.It Fl o Ar output , Fl Fl output Ns = Ns Ar output 97Print the output to the 98.Ar output 99file instead of the standard output. 100.It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size 101Use 102.Ar size 103for the maximum size of the memory buffer. 104Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used. 105If a memory limit is not explicitly specified, 106.Nm 107takes up to about 90% of available memory. 108If the file size is too big to fit into the memory buffer, 109the temporary disk files are used to perform the sorting. 110.It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir 111Store temporary files in the directory 112.Ar dir . 113The default path is the value of the environment variable 114.Ev TMPDIR 115or 116.Pa /var/tmp 117if 118.Ev TMPDIR 119is not defined. 120.It Fl u , Fl Fl unique 121Unique keys. 122Suppress all lines that have a key that is equal to an already 123processed one. 124This option, similarly to 125.Fl s , 126implies a stable sort. 127If used with 128.Fl c 129or 130.Fl C , 131.Nm 132also checks that there are no lines with duplicate keys. 133.It Fl s 134Stable sort. 135This option maintains the original record order of records that have 136an equal key. 137This is a non-standard feature, but it is widely accepted and used. 138.It Fl Fl version 139Print the version and silently exits. 140.It Fl Fl help 141Print the help text and silently exits. 142.El 143.Pp 144The following options override the default ordering rules. 145When ordering options appear independently of key field 146specifications, they apply globally to all sort keys. 147When attached to a specific key (see 148.Fl k ) , 149the ordering options override all global ordering options for 150the key they are attached to. 151.Bl -tag -width indent 152.It Fl b , Fl Fl ignore-leading-blanks 153Ignore leading blank characters when comparing lines. 154.It Fl d , Fl Fl dictionary-order 155Consider only blank spaces and alphanumeric characters in comparisons. 156.It Fl f , Fl Fl ignore-case 157Convert all lowercase characters to their uppercase equivalent 158before comparison, that is, perform case-independent sorting. 159.It Fl g , Fl Fl general-numeric-sort , Fl Fl sort=general-numeric 160Sort by general numerical value. 161As opposed to 162.Fl n , 163this option handles general floating points. 164It has a more 165permissive format than that allowed by 166.Fl n 167but it has a significant performance drawback. 168.It Fl h , Fl Fl human-numeric-sort , Fl Fl sort=human-numeric 169Sort by numerical value, but take into account the SI suffix, 170if present. 171Sort first by numeric sign (negative, zero, or 172positive); then by SI suffix (either empty, or `k' or `K', or one 173of `MGTPEZY', in that order); and finally by numeric value. 174The SI suffix must immediately follow the number. 175For example, '12345K' sorts before '1M', because M is "larger" than K. 176This sort option is useful for sorting the output of a single invocation 177of 'df' command with 178.Fl h 179or 180.Fl H 181options (human-readable). 182.It Fl i , Fl Fl ignore-nonprinting 183Ignore all non-printable characters. 184.It Fl M , Fl Fl month-sort , Fl Fl sort=month 185Sort by month abbreviations. 186Unknown strings are considered smaller than the month names. 187.It Fl n , Fl Fl numeric-sort , Fl Fl sort=numeric 188Sort fields numerically by arithmetic value. 189Fields are supposed to have optional blanks in the beginning, an 190optional minus sign, zero or more digits (including decimal point and 191possible thousand separators). 192.It Fl R , Fl Fl random-sort , Fl Fl sort=random 193Sort by a random order. 194This is a random permutation of the inputs except that 195the equal keys sort together. 196It is implemented by hashing the input keys and sorting 197the hash values. 198The hash function is chosen randomly. 199The hash function is randomized by 200.Cm /dev/random 201content, or by file content if it is specified by 202.Fl Fl random-source . 203Even if multiple sort fields are specified, 204the same random hash function is used for all of them. 205.It Fl r , Fl Fl reverse 206Sort in reverse order. 207.It Fl V , Fl Fl version-sort 208Sort version numbers. 209The input lines are treated as file names in form 210PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression 211"(\.([A-Za-z~][A-Za-z0-9~]*)?)*". 212The files are compared by their prefixes and versions (leading 213zeros are ignored in version numbers, see example below). 214If an input string does not match the pattern, then it is compared 215using the byte compare function. 216All string comparisons are performed in C locale, the locale 217environment setting is ignored. 218.Bl -tag -width indent 219.It Example: 220.It $ ls sort* | sort -V 221.It sort-1.022.tgz 222.It sort-1.23.tgz 223.It sort-1.23.1.tgz 224.It sort-1.024.tgz 225.It sort-1.024.003. 226.It sort-1.024.003.tgz 227.It sort-1.024.07.tgz 228.It sort-1.024.009.tgz 229.El 230.El 231.Pp 232The treatment of field separators can be altered using these options: 233.Bl -tag -width indent 234.It Fl b , Fl Fl ignore-leading-blanks 235Ignore leading blank space when determining the start 236and end of a restricted sort key (see 237.Fl k ) . 238If 239.Fl b 240is specified before the first 241.Fl k 242option, it applies globally to all key specifications. 243Otherwise, 244.Fl b 245can be attached independently to each 246.Ar field 247argument of the key specifications. 248.Fl b . 249.It Xo 250.Fl k Ar field1 Ns Op , Ns Ar field2 , 251.Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2 252.Xc 253Define a restricted sort key that has the starting position 254.Ar field1 , 255and optional ending position 256.Ar field2 257of a key field. 258The 259.Fl k 260option may be specified multiple times, 261in which case subsequent keys are compared when earlier keys compare equal. 262The 263.Fl k 264option replaces the obsolete options 265.Cm \(pl Ns Ar pos1 266and 267.Fl Ns Ar pos2 , 268but the old notation is also supported. 269.It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char 270Use 271.Ar char 272as a field separator character. 273The initial 274.Ar char 275is not considered to be part of a field when determining key offsets. 276Each occurrence of 277.Ar char 278is significant (for example, 279.Dq Ar charchar 280delimits an empty field). 281If 282.Fl t 283is not specified, the default field separator is a sequence of 284blank space characters, and consecutive blank spaces do 285.Em not 286delimit an empty field, however, the initial blank space 287.Em is 288considered part of a field when determining key offsets. 289To use NUL as field separator, use 290.Fl t 291\'\\0\'. 292.It Fl z , Fl Fl zero-terminated 293Use NUL as record separator. 294By default, records in the files are supposed to be separated by 295the newline characters. 296With this option, NUL (\'\\0\') is used as a record separator character. 297.El 298.Pp 299Other options: 300.Bl -tag -width indent 301.It Fl Fl batch-size Ns = Ns Ar num 302Specify maximum number of files that can be opened by 303.Nm 304at once. 305This option affects behavior when having many input files or using 306temporary files. 307The default value is 16. 308.It Fl Fl compress-program Ns = Ns Ar PROGRAM 309Use PROGRAM to compress temporary files. 310PROGRAM must compress standard input to standard output, when called 311without arguments. 312When called with argument 313.Fl d 314it must decompress standard input to standard output. 315If PROGRAM fails, 316.Nm 317must exit with error. 318An example of PROGRAM that can be used here is bzip2. 319.It Fl Fl random-source Ns = Ns Ar filename 320In random sort, the file content is used as the source of the 'seed' data 321for the hash function choice. 322Two invocations of random sort with the same seed data will use 323the same hash function and will produce the same result if the input is 324also identical. 325By default, file 326.Cm /dev/random 327is used. 328.It Fl Fl debug 329Print some extra information about the sorting process to the 330standard output. 331%%THREADS%%.It Fl Fl parallel 332%%THREADS%%Set the maximum number of execution threads. 333%%THREADS%%Default number equals to the number of CPUs. 334.It Fl Fl files0-from Ns = Ns Ar filename 335Take the input file list from the file 336.Ar filename . 337The file names must be separated by NUL 338(like the output produced by the command "find ... -print0"). 339.It Fl Fl radixsort 340Try to use radix sort, if the sort specifications allow. 341The radix sort can only be used for trivial locales (C and POSIX), 342and it cannot be used for numeric or month sort. 343Radix sort is very fast and stable. 344.It Fl Fl mergesort 345Use mergesort. 346This is a universal algorithm that can always be used, 347but it is not always the fastest. 348.It Fl Fl qsort 349Try to use quick sort, if the sort specifications allow. 350This sort algorithm cannot be used with 351.Fl u 352and 353.Fl s . 354.It Fl Fl heapsort 355Try to use heap sort, if the sort specifications allow. 356This sort algorithm cannot be used with 357.Fl u 358and 359.Fl s . 360.It Fl Fl mmap 361Try to use file memory mapping system call. 362It may increase speed in some cases. 363.El 364.Pp 365The following operands are available: 366.Bl -tag -width indent 367.It Ar file 368The pathname of a file to be sorted, merged, or checked. 369If no 370.Ar file 371operands are specified, or if a 372.Ar file 373operand is 374.Fl , 375the standard input is used. 376.El 377.Pp 378A field is defined as a maximal sequence of characters other than the 379field separator and record separator (newline by default). 380Initial blank spaces are included in the field unless 381.Fl b 382has been specified; 383the first blank space of a sequence of blank spaces acts as the field 384separator and is included in the field (unless 385.Fl t 386is specified). 387For example, all blank spaces at the beginning of a line are 388considered to be part of the first field. 389.Pp 390Fields are specified by the 391.Sm off 392.Fl k\ \& Ar field1 Op , Ar field2 393.Sm on 394command-line option. 395If 396.Ar field2 397is missing, the end of the key defaults to the end of the line. 398.Pp 399The arguments 400.Ar field1 401and 402.Ar field2 403have the form 404.Em m.n 405.Em (m,n > 0) 406and can be followed by one or more of the modifiers 407.Cm b , d , f , i , 408.Cm n , g , M 409and 410.Cm r , 411which correspond to the options discussed above. 412When 413.Cm b 414is specified it applies only to 415.Ar field1 416or 417.Ar field2 418where it is specified while the rest of the modifiers 419apply to the whole key field regardless if they are 420specified only with 421.Ar field1 422or 423.Ar field2 424or both. 425A 426.Ar field1 427position specified by 428.Em m.n 429is interpreted as the 430.Em n Ns th 431character from the beginning of the 432.Em m Ns th 433field. 434A missing 435.Em \&.n 436in 437.Ar field1 438means 439.Ql \&.1 , 440indicating the first character of the 441.Em m Ns th 442field; if the 443.Fl b 444option is in effect, 445.Em n 446is counted from the first non-blank character in the 447.Em m Ns th 448field; 449.Em m Ns \&.1b 450refers to the first non-blank character in the 451.Em m Ns th 452field. 453.No 1\&. Ns Em n 454refers to the 455.Em n Ns th 456character from the beginning of the line; 457if 458.Em n 459is greater than the length of the line, the field is taken to be empty. 460.Pp 461.Em n Ns th 462positions are always counted from the field beginning, even if the field 463is shorter than the number of specified positions. 464Thus, the key can really start from a position in a subsequent field. 465.Pp 466A 467.Ar field2 468position specified by 469.Em m.n 470is interpreted as the 471.Em n Ns th 472character (including separators) from the beginning of the 473.Em m Ns th 474field. 475A missing 476.Em \&.n 477indicates the last character of the 478.Em m Ns th 479field; 480.Em m 481= \&0 482designates the end of a line. 483Thus the option 484.Fl k Ar v.x,w.y 485is synonymous with the obsolete option 486.Cm \(pl Ns Ar v-\&1.x-\&1 487.Fl Ns Ar w-\&1.y ; 488when 489.Em y 490is omitted, 491.Fl k Ar v.x,w 492is synonymous with 493.Cm \(pl Ns Ar v-\&1.x-\&1 494.Fl Ns Ar w\&.0 . 495The obsolete 496.Cm \(pl Ns Ar pos1 497.Fl Ns Ar pos2 498option is still supported, except for 499.Fl Ns Ar w\&.0b , 500which has no 501.Fl k 502equivalent. 503.Sh ENVIRONMENT 504.Bl -tag -width Fl 505.It Ev LC_COLLATE 506Locale settings to be used to determine the collation for 507sorting records. 508.It Ev LC_CTYPE 509Locale settings to be used to case conversion and classification 510of characters, that is, which characters are considered 511whitespaces, etc. 512.It Ev LC_MESSAGES 513Locale settings that determine the language of output messages 514that 515.Nm 516prints out. 517.It Ev LC_NUMERIC 518Locale settings that determine the number format used in numeric sort. 519.It Ev LC_TIME 520Locale settings that determine the month format used in month sort. 521.It Ev LC_ALL 522Locale settings that override all of the above locale settings. 523This environment variable can be used to set all these settings 524to the same value at once. 525.It Ev LANG 526Used as a last resort to determine different kinds of locale-specific 527behavior if neither the respective environment variable, nor 528.Ev LC_ALL 529are set. 530%%NLS%%.It Ev NLSPATH 531%%NLS%%Path to NLS catalogs. 532.It Ev TMPDIR 533Path to the directory in which temporary files will be stored. 534Note that 535.Ev TMPDIR 536may be overridden by the 537.Fl T 538option. 539.It Ev GNUSORT_NUMERIC_COMPATIBILITY 540If defined 541.Fl t 542will not override the locale numeric symbols, that is, thousand 543separators and decimal separators. 544By default, if we specify 545.Fl t 546with the same symbol as the thousand separator or decimal point, 547the symbol will be treated as the field separator. 548Older behavior was less definite; the symbol was treated as both field 549separator and numeric separator, simultaneously. 550This environment variable enables the old behavior. 551.El 552.Sh FILES 553.Bl -tag -width Pa -compact 554.It Pa /var/tmp/.bsdsort.PID.* 555Temporary files. 556.It Pa /dev/random 557Default seed file for the random sort. 558.El 559.Sh EXIT STATUS 560The 561.Nm 562utility shall exit with one of the following values: 563.Pp 564.Bl -tag -width flag -compact 565.It 0 566Successfully sorted the input files or if used with 567.Fl c 568or 569.Fl C , 570the input file already met the sorting criteria. 571.It 1 572On disorder (or non-uniqueness) with the 573.Fl c 574or 575.Fl C 576options. 577.It 2 578An error occurred. 579.El 580.Sh SEE ALSO 581.Xr comm 1 , 582.Xr join 1 , 583.Xr uniq 1 584.Sh STANDARDS 585The 586.Nm 587utility is compliant with the 588.St -p1003.1-2008 589specification. 590.Pp 591The flags 592.Op Fl ghRMSsTVz 593are extensions to the POSIX specification. 594.Pp 595All long options are extensions to the specification, some of them are 596provided for compatibility with GNU versions and some of them are 597own extensions. 598.Pp 599The old key notations 600.Cm \(pl Ns Ar pos1 601and 602.Fl Ns Ar pos2 603come from older versions of 604.Nm 605and are still supported but their use is highly discouraged. 606.Sh HISTORY 607A 608.Nm 609command first appeared in 610.At v3 . 611.Sh AUTHORS 612.An Gabor Kovesdan Aq Mt gabor@FreeBSD.org , 613.Pp 614.An Oleg Moskalenko Aq Mt mom040267@gmail.com 615.Sh NOTES 616This implementation of 617.Nm 618has no limits on input line length (other than imposed by available 619memory) or any restrictions on bytes allowed within lines. 620.Pp 621The performance depends highly on locale settings, 622efficient choice of sort keys and key complexity. 623The fastest sort is with locale C, on whole lines, 624with option 625.Fl s . 626In general, locale C is the fastest, then single-byte 627locales follow and multi-byte locales as the slowest but 628the correct collation order is always respected. 629As for the key specification, the simpler to process the 630lines the faster the search will be. 631.Pp 632When sorting by arithmetic value, using 633.Fl n 634results in much better performance than 635.Fl g 636so its use is encouraged 637whenever possible. 638