1.\" $Id: mandoc.3,v 1.46 2025/02/25 17:03:54 schwarze Exp $ 2.\" 3.\" Copyright (c) 2009, 2010, 2011 Kristaps Dzonsons <kristaps@bsd.lv> 4.\" Copyright (c) 2010-2017 Ingo Schwarze <schwarze@openbsd.org> 5.\" 6.\" Permission to use, copy, modify, and distribute this software for any 7.\" purpose with or without fee is hereby granted, provided that the above 8.\" copyright notice and this permission notice appear in all copies. 9.\" 10.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 11.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 12.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 13.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 14.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 15.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 16.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 17.\" 18.Dd $Mdocdate: February 25 2025 $ 19.Dt MANDOC 3 20.Os 21.Sh NAME 22.Nm mandoc , 23.Nm deroff , 24.Nm mparse_alloc , 25.Nm mparse_copy , 26.Nm mparse_free , 27.Nm mparse_open , 28.Nm mparse_readfd , 29.Nm mparse_reset , 30.Nm mparse_result 31.Nd mandoc macro compiler library 32.Sh SYNOPSIS 33.In sys/types.h 34.In stdio.h 35.In mandoc.h 36.In roff.h 37.In mandoc_parse.h 38.Pp 39.Fd "#define ASCII_NBRSP" 40.Fd "#define ASCII_HYPH" 41.Fd "#define ASCII_BREAK" 42.Ft struct mparse * 43.Fo mparse_alloc 44.Fa "int options" 45.Fa "enum mandoc_os oe_e" 46.Fa "char *os_s" 47.Fc 48.Ft void 49.Fo mparse_free 50.Fa "struct mparse *parse" 51.Fc 52.Ft void 53.Fo mparse_copy 54.Fa "const struct mparse *parse" 55.Fc 56.Ft int 57.Fo mparse_open 58.Fa "struct mparse *parse" 59.Fa "const char *fname" 60.Fc 61.Ft void 62.Fo mparse_readfd 63.Fa "struct mparse *parse" 64.Fa "int fd" 65.Fa "const char *fname" 66.Fc 67.Ft void 68.Fo mparse_reset 69.Fa "struct mparse *parse" 70.Fc 71.Ft struct roff_meta * 72.Fo mparse_result 73.Fa "struct mparse *parse" 74.Fc 75.In roff.h 76.Ft void 77.Fo deroff 78.Fa "char **dest" 79.Fa "const struct roff_node *node" 80.Fc 81.In sys/types.h 82.In mandoc.h 83.In mdoc.h 84.Vt extern const char * const * mdoc_argnames; 85.Vt extern const char * const * mdoc_macronames; 86.In sys/types.h 87.In mandoc.h 88.In man.h 89.Vt extern const char * const * man_macronames; 90.Sh DESCRIPTION 91The 92.Nm mandoc 93library parses a 94.Ux 95manual into an abstract syntax tree (AST). 96.Ux 97manuals are composed of 98.Xr mdoc 7 99or 100.Xr man 7 , 101and may be mixed with 102.Xr roff 7 , 103.Xr tbl 7 , 104and 105.Xr eqn 7 106invocations. 107.Pp 108The following describes a general parse sequence: 109.Bl -enum 110.It 111initiate a parsing sequence with 112.Xr mchars_alloc 3 113and 114.Fn mparse_alloc ; 115.It 116open a file with 117.Xr open 2 118or 119.Fn mparse_open ; 120.It 121parse it with 122.Fn mparse_readfd ; 123.It 124close it with 125.Xr close 2 ; 126.It 127retrieve the syntax tree with 128.Fn mparse_result ; 129.It 130if information about the validity of the input is needed, fetch it with 131.Fn mparse_updaterc ; 132.It 133iterate over parse nodes with starting from the 134.Fa first 135member of the returned 136.Vt struct roff_meta ; 137.It 138free all allocated memory with 139.Fn mparse_free 140and 141.Xr mchars_free 3 , 142or invoke 143.Fn mparse_reset 144and go back to step 2 to parse new files. 145.El 146.Pp 147The design goals of the 148.Nm mandoc 149library are limited to providing the functionality required by the 150.Xr mandoc 1 151program. 152Consequently, the functions documented in the present manual page 153do not aim for API stability. 154Any third-party program using them typically requires adjustments after every 155.Nm mandoc 156release. 157Linking such a program requires 158.Fl lz 159because 160.Fn mparse_readfd 161calls 162.Xr gzdopen 3 , 163.Xr gzread 3 , 164.Xr gzerror 3 , 165and 166.Xr gzclose 3 . 167For 168.Xr mandoc 1 169itself, the 170.Pa ./configure 171script automatically adds 172.Fl lz 173to the 174.Ev LDADD 175.Xr make 1 176variable. 177.Sh REFERENCE 178This section documents the functions, types, and variables available 179via 180.In mandoc.h , 181with the exception of those documented in 182.Xr mandoc_escape 3 183and 184.Xr mchars_alloc 3 . 185.Ss Types 186.Bl -ohang 187.It Vt "enum mandocerr" 188An error or warning message during parsing. 189.It Vt "enum mandoclevel" 190A classification of an 191.Vt "enum mandocerr" 192as regards system operation. 193See the DIAGNOSTICS section in 194.Xr mandoc 1 195regarding the meanings of the levels. 196.It Vt "struct mparse" 197An opaque pointer to a running parse sequence. 198Created with 199.Fn mparse_alloc 200and freed with 201.Fn mparse_free . 202This may be used across parsed input if 203.Fn mparse_reset 204is called between parses. 205.El 206.Ss Functions 207.Bl -ohang 208.It Fn deroff 209Obtain a text-only representation of a 210.Vt struct roff_node , 211including text contained in its child nodes. 212To be used on children of the 213.Fa first 214member of 215.Vt struct roff_meta . 216When it is no longer needed, the pointer returned from 217.Fn deroff 218can be passed to 219.Xr free 3 . 220.It Fn mparse_alloc 221Allocate a parser. 222The arguments have the following effect: 223.Bl -tag -offset 5n -width inttype 224.It Ar options 225When the 226.Dv MPARSE_MDOC 227or 228.Dv MPARSE_MAN 229bit is set, only that parser is used. 230Otherwise, the document type is automatically detected. 231.Pp 232When the 233.Dv MPARSE_SO 234bit is set, 235.Xr roff 7 236.Ic \&so 237file inclusion requests are always honoured. 238Otherwise, if the request is the only content in an input file, 239only the file name is remembered, to be returned in the 240.Fa sodest 241field of 242.Vt struct roff_meta . 243.Pp 244When the 245.Dv MPARSE_QUICK 246bit is set, parsing is aborted after the NAME section. 247This is for example useful in 248.Xr makewhatis 8 249.Fl Q 250to quickly build minimal databases. 251.Pp 252When the 253.Dv MARSE_VALIDATE 254bit is set, 255.Fn mparse_result 256runs the validation functions before returning the syntax tree. 257This is almost always required, except in certain debugging scenarios, 258for example to dump unvalidated syntax trees. 259.It Ar os_e 260Operating system to check base system conventions for. 261If 262.Dv MANDOC_OS_OTHER , 263the system is automatically detected from 264.Ic \&Os , 265.Fl Ios , 266or 267.Xr uname 3 . 268.It Ar os_s 269A default string for the 270.Xr mdoc 7 271.Ic \&Os 272macro, overriding the 273.Dv OSNAME 274preprocessor definition and the results of 275.Xr uname 3 . 276Passing 277.Dv NULL 278sets no default. 279.El 280.Pp 281The same parser may be used for multiple files so long as 282.Fn mparse_reset 283is called between parses. 284.Fn mparse_free 285must be called to free the memory allocated by this function. 286Declared in 287.In mandoc.h , 288implemented in 289.Pa read.c . 290.It Fn mparse_free 291Free all memory allocated by 292.Fn mparse_alloc . 293Declared in 294.In mandoc.h , 295implemented in 296.Pa read.c . 297.It Fn mparse_copy 298Dump a copy of the input to the standard output; used for 299.Fl man T Ns Cm man . 300Declared in 301.In mandoc.h , 302implemented in 303.Pa read.c . 304.It Fn mparse_open 305Open the file for reading. 306If that fails and 307.Fa fname 308does not already end in 309.Ql .gz , 310try again after appending 311.Ql .gz . 312Save the information whether the file is zipped or not. 313Return a file descriptor open for reading or -1 on failure. 314It can be passed to 315.Fn mparse_readfd 316or used directly. 317Declared in 318.In mandoc.h , 319implemented in 320.Pa read.c . 321.It Fn mparse_readfd 322Parse a file descriptor opened with 323.Xr open 2 324or 325.Fn mparse_open . 326Pass the associated filename in 327.Va fname . 328This function may be called multiple times with different parameters; however, 329.Xr close 2 330and 331.Fn mparse_reset 332should be invoked between parses. 333Declared in 334.In mandoc.h , 335implemented in 336.Pa read.c . 337.It Fn mparse_reset 338Reset a parser so that 339.Fn mparse_readfd 340may be used again. 341Declared in 342.In mandoc.h , 343implemented in 344.Pa read.c . 345.It Fn mparse_result 346Obtain the result of a parse. 347Declared in 348.In mandoc.h , 349implemented in 350.Pa read.c . 351.El 352.Ss Variables 353.Bl -ohang 354.It Va man_macronames 355The string representation of a 356.Xr man 7 357macro as indexed by 358.Vt "enum mant" . 359.It Va mdoc_argnames 360The string representation of an 361.Xr mdoc 7 362macro argument as indexed by 363.Vt "enum mdocargt" . 364.It Va mdoc_macronames 365The string representation of an 366.Xr mdoc 7 367macro as indexed by 368.Vt "enum mdoct" . 369.El 370.Sh IMPLEMENTATION NOTES 371This section consists of structural documentation for 372.Xr mdoc 7 373and 374.Xr man 7 375syntax trees and strings. 376.Ss Man and Mdoc Strings 377Strings may be extracted from mdoc and man meta-data, or from text 378nodes (MDOC_TEXT and MAN_TEXT, respectively). 379These strings have special non-printing formatting cues embedded in the 380text itself, as well as 381.Xr roff 7 382escapes preserved from input. 383Implementing systems will need to handle both situations to produce 384human-readable text. 385In general, strings may be assumed to consist of 7-bit ASCII characters. 386.Pp 387The following non-printing characters may be embedded in text strings: 388.Bl -tag -width Ds 389.It Dv ASCII_NBRSP 390A non-breaking space character. 391.It Dv ASCII_HYPH 392A soft hyphen. 393.It Dv ASCII_BREAK 394A breakable zero-width space. 395.El 396.Pp 397Escape characters are also passed verbatim into text strings. 398An escape character is a sequence of characters beginning with the 399backslash 400.Pq Sq \e . 401To construct human-readable text, these should be intercepted with 402.Xr mandoc_escape 3 403and converted with one the functions described in 404.Xr mchars_alloc 3 . 405.Ss Man Abstract Syntax Tree 406This AST is governed by the ontological rules dictated in 407.Xr man 7 408and derives its terminology accordingly. 409.Pp 410The AST is composed of 411.Vt struct roff_node 412nodes with element, root and text types as declared by the 413.Va type 414field. 415Each node also provides its parse point (the 416.Va line , 417.Va pos , 418and 419.Va sec 420fields), its position in the tree (the 421.Va parent , 422.Va child , 423.Va next 424and 425.Va prev 426fields) and some type-specific data. 427.Pp 428The tree itself is arranged according to the following normal form, 429where capitalised non-terminals represent nodes. 430.Pp 431.Bl -tag -width "ELEMENTXX" -compact 432.It ROOT 433\(<- mnode+ 434.It mnode 435\(<- ELEMENT | TEXT | BLOCK 436.It BLOCK 437\(<- HEAD BODY 438.It HEAD 439\(<- mnode* 440.It BODY 441\(<- mnode* 442.It ELEMENT 443\(<- ELEMENT | TEXT* 444.It TEXT 445\(<- [[:ascii:]]* 446.El 447.Pp 448The only elements capable of nesting other elements are those with 449next-line scope as documented in 450.Xr man 7 . 451.Ss Mdoc Abstract Syntax Tree 452This AST is governed by the ontological 453rules dictated in 454.Xr mdoc 7 455and derives its terminology accordingly. 456.Qq In-line 457elements described in 458.Xr mdoc 7 459are described simply as 460.Qq elements . 461.Pp 462The AST is composed of 463.Vt struct roff_node 464nodes with block, head, body, element, root and text types as declared 465by the 466.Va type 467field. 468Each node also provides its parse point (the 469.Va line , 470.Va pos , 471and 472.Va sec 473fields), its position in the tree (the 474.Va parent , 475.Va child , 476.Va last , 477.Va next 478and 479.Va prev 480fields) and some type-specific data, in particular, for nodes generated 481from macros, the generating macro in the 482.Va tok 483field. 484.Pp 485The tree itself is arranged according to the following normal form, 486where capitalised non-terminals represent nodes. 487.Pp 488.Bl -tag -width "ELEMENTXX" -compact 489.It ROOT 490\(<- mnode+ 491.It mnode 492\(<- BLOCK | ELEMENT | TEXT 493.It BLOCK 494\(<- HEAD [TEXT] (BODY [TEXT])+ [TAIL [TEXT]] 495.It ELEMENT 496\(<- TEXT* 497.It HEAD 498\(<- mnode* 499.It BODY 500\(<- mnode* [ENDBODY mnode*] 501.It TAIL 502\(<- mnode* 503.It TEXT 504\(<- [[:ascii:]]* 505.El 506.Pp 507Of note are the TEXT nodes following the HEAD, BODY and TAIL nodes of 508the BLOCK production: these refer to punctuation marks. 509Furthermore, although a TEXT node will generally have a non-zero-length 510string, in the specific case of 511.Sq \&.Bd \-literal , 512an empty line will produce a zero-length string. 513Multiple body parts are only found in invocations of 514.Sq \&Bl \-column , 515where a new body introduces a new phrase. 516.Pp 517The 518.Xr mdoc 7 519syntax tree accommodates for broken block structures as well. 520The ENDBODY node is available to end the formatting associated 521with a given block before the physical end of that block. 522It has a non-null 523.Va end 524field, is of the BODY 525.Va type , 526has the same 527.Va tok 528as the BLOCK it is ending, and has a 529.Va pending 530field pointing to that BLOCK's BODY node. 531It is an indirect child of that BODY node 532and has no children of its own. 533.Pp 534An ENDBODY node is generated when a block ends while one of its child 535blocks is still open, like in the following example: 536.Bd -literal -offset indent 537\&.Ao ao 538\&.Bo bo ac 539\&.Ac bc 540\&.Bc end 541.Ed 542.Pp 543This example results in the following block structure: 544.Bd -literal -offset indent 545BLOCK Ao 546 HEAD Ao 547 BODY Ao 548 TEXT ao 549 BLOCK Bo, pending -> Ao 550 HEAD Bo 551 BODY Bo 552 TEXT bo 553 TEXT ac 554 ENDBODY Ao, pending -> Ao 555 TEXT bc 556TEXT end 557.Ed 558.Pp 559Here, the formatting of the 560.Ic \&Ao 561block extends from TEXT ao to TEXT ac, 562while the formatting of the 563.Ic \&Bo 564block extends from TEXT bo to TEXT bc. 565It renders as follows in 566.Fl T Ns Cm ascii 567mode: 568.Pp 569.Dl <ao [bo ac> bc] end 570.Pp 571Support for badly-nested blocks is only provided for backward 572compatibility with some older 573.Xr mdoc 7 574implementations. 575Using badly-nested blocks is 576.Em strongly discouraged ; 577for example, the 578.Fl T Ns Cm html 579front-end to 580.Xr mandoc 1 581is unable to render them in any meaningful way. 582Furthermore, behaviour when encountering badly-nested blocks is not 583consistent across troff implementations, especially when using multiple 584levels of badly-nested blocks. 585.Sh SEE ALSO 586.Xr mandoc 1 , 587.Xr man.cgi 3 , 588.Xr mandoc_escape 3 , 589.Xr mandoc_headers 3 , 590.Xr mandoc_malloc 3 , 591.Xr mansearch 3 , 592.Xr mchars_alloc 3 , 593.Xr tbl 3 , 594.Xr eqn 7 , 595.Xr man 7 , 596.Xr mandoc_char 7 , 597.Xr mdoc 7 , 598.Xr roff 7 , 599.Xr tbl 7 600.Sh AUTHORS 601.An -nosplit 602The 603.Nm 604library was written by 605.An Kristaps Dzonsons Aq Mt kristaps@bsd.lv 606and is maintained by 607.An Ingo Schwarze Aq Mt schwarze@openbsd.org . 608