1<!-- 2 __ __ _ 3 ___\ \/ /_ __ __ _| |_ 4 / _ \\ /| '_ \ / _` | __| 5 | __// \| |_) | (_| | |_ 6 \___/_/\_\ .__/ \__,_|\__| 7 |_| XML parser 8 9 Copyright (c) 2001 Scott Bronson <bronson@rinspin.com> 10 Copyright (c) 2002-2003 Fred L. Drake, Jr. <fdrake@users.sourceforge.net> 11 Copyright (c) 2009 Karl Waclawek <karl@waclawek.net> 12 Copyright (c) 2016-2026 Sebastian Pipping <sebastian@pipping.org> 13 Copyright (c) 2016 Ardo van Rangelrooij <ardo@debian.org> 14 Copyright (c) 2017 Rhodri James <rhodri@wildebeest.org.uk> 15 Copyright (c) 2020 Joe Orton <jorton@redhat.com> 16 Copyright (c) 2021 Tim Bray <tbray@textuality.com> 17 Unlike most of Expat, 18 this file is copyrighted under the GNU Free Documentation License 1.1. 19--> 20<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" 21 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" [ 22 <!ENTITY dhfirstname "<firstname>Scott</firstname>"> 23 <!ENTITY dhsurname "<surname>Bronson</surname>"> 24 <!ENTITY dhdate "<date>May 10, 2026</date>"> 25 <!-- Please adjust this^^ date whenever cutting a new release. --> 26 <!ENTITY dhsection "<manvolnum>1</manvolnum>"> 27 <!ENTITY dhemail "<email>bronson@rinspin.com</email>"> 28 <!ENTITY dhusername "Scott Bronson"> 29 <!ENTITY dhucpackage "<refentrytitle>XMLWF</refentrytitle>"> 30 <!ENTITY dhpackage "xmlwf"> 31 32 <!ENTITY gnu "<acronym>GNU</acronym>"> 33 <!ENTITY debian "<productname>Debian &gnu;/Linux</productname>"> 34]> 35 36<refentry> 37 <refentryinfo> 38 <address> 39 &dhemail; 40 </address> 41 <author> 42 &dhfirstname; 43 &dhsurname; 44 </author> 45 <copyright> 46 <year>2001</year> 47 <holder>&dhusername;</holder> 48 </copyright> 49 &dhdate; 50 </refentryinfo> 51 <refmeta> 52 &dhucpackage; 53 54 &dhsection; 55 </refmeta> 56 <refnamediv> 57 <refname>&dhpackage;</refname> 58 59 <refpurpose>Determines if an XML document is well-formed</refpurpose> 60 </refnamediv> 61 <refsynopsisdiv> 62 <cmdsynopsis> 63 <command>&dhpackage;</command> 64 <arg><replaceable>OPTIONS</replaceable></arg> 65 <arg><replaceable>FILE</replaceable> ...</arg> 66 </cmdsynopsis> 67 <cmdsynopsis> 68 <command>&dhpackage;</command> 69 <group choice="plain"> 70 <arg><option>-h</option></arg> 71 <arg><option>--help</option></arg> 72 </group> 73 </cmdsynopsis> 74 <cmdsynopsis> 75 <command>&dhpackage;</command> 76 <group choice="plain"> 77 <arg><option>-v</option></arg> 78 <arg><option>--version</option></arg> 79 </group> 80 </cmdsynopsis> 81 </refsynopsisdiv> 82 83 <refsect1> 84 <title>DESCRIPTION</title> 85 86 <para> 87 <command>&dhpackage;</command> uses the Expat library to 88 determine if an XML document is well-formed. It is 89 non-validating. 90 </para> 91 <para> 92 If you do not specify any files on the command-line, and you 93 have a recent version of <command>&dhpackage;</command>, the 94 input file will be read from standard input. 95 </para> 96 97 </refsect1> 98 99 <refsect1> 100 <title>WELL-FORMED DOCUMENTS</title> 101 <para> 102 A well-formed document must adhere to the 103 following rules: 104 </para> 105 <itemizedlist> 106 <listitem> 107 <para> 108 The file begins with an XML declaration. For instance, 109 <literal><?xml version="1.0" standalone="yes"?></literal>. 110 <emphasis>NOTE</emphasis>: 111 <command>&dhpackage;</command> does not currently 112 check for a valid XML declaration. 113 </para> 114 </listitem> 115 <listitem> 116 <para> 117 Every start tag is either empty (<tag/>) 118 or has a corresponding end tag. 119 </para> 120 </listitem> 121 <listitem> 122 <para> 123 There is exactly one root element. This element must contain 124 all other elements in the document. Only comments, white 125 space, and processing instructions may come after the close 126 of the root element. 127 </para> 128 </listitem> 129 <listitem> 130 <para> 131 All elements nest properly. 132 </para> 133 </listitem> 134 <listitem> 135 <para> 136 All attribute values are enclosed in quotes (either single 137 or double). 138 </para> 139 </listitem> 140 </itemizedlist> 141 <para> 142 If the document has a DTD, and it strictly complies with that 143 DTD, then the document is also considered <emphasis>valid</emphasis>. 144 <command>&dhpackage;</command> is a non-validating parser -- 145 it does not check the DTD. However, it does support 146 external entities (see the <option>-x</option> option). 147 </para> 148 </refsect1> 149 150 <refsect1> 151 <title>OPTIONS</title> 152 <para> 153 When an option includes an argument, you may specify the argument either 154 separately ("<option>-d</option> <replaceable>output</replaceable>") or concatenated with the 155 option ("<option>-d</option><replaceable>output</replaceable>"). <command>&dhpackage;</command> 156 supports both. 157 </para> 158 <variablelist> 159 160 <varlistentry> 161 <term><option>-a</option> <replaceable>factor</replaceable></term> 162 <listitem> 163 <para> 164 Sets the maximum tolerated amplification factor 165 for protection against amplification attacks 166 like the billion laughs attack 167 (default: 100.0 168 for the sum of direct and indirect output and also 169 for allocations of dynamic memory). 170 The amplification factor is calculated as .. 171 </para> 172 <literallayout> 173 amplification := (direct + indirect) / direct 174 </literallayout> 175 <para> 176 .. with regard to use of entities and .. 177 </para> 178 <literallayout> 179 amplification := allocated / direct 180 </literallayout> 181 <para> 182 .. with regard to dynamic memory while parsing. 183 <direct> is the number of bytes read 184 from the primary document in parsing, 185 <indirect> is the number of bytes 186 added by expanding entities and reading of external DTD files, 187 combined, and 188 <allocated> is the total number of bytes of dynamic memory 189 allocated (and not freed) per hierarchy of parsers. 190 </para> 191 <para> 192 <emphasis>NOTE</emphasis>: 193 If you ever need to increase this value for non-attack payload, 194 please file a bug report. 195 </para> 196 </listitem> 197 </varlistentry> 198 199 <varlistentry> 200 <term><option>-b</option> <replaceable>bytes</replaceable></term> 201 <listitem> 202 <para> 203 Sets the number of output bytes (including amplification) 204 needed to activate protection against amplification attacks 205 like billion laughs 206 (default: 8 MiB for the sum of direct and indirect output, 207 and 64 MiB for allocations of dynamic memory). 208 This can be thought of as an "activation threshold". 209 </para> 210 <para> 211 <emphasis>NOTE</emphasis>: 212 If you ever need to increase this value for non-attack payload, 213 please file a bug report. 214 </para> 215 </listitem> 216 </varlistentry> 217 218 <varlistentry> 219 <term><option>-c</option></term> 220 <listitem> 221 <para> 222 If the input file is well-formed and <command>&dhpackage;</command> 223 doesn't encounter any errors, the input file is simply copied to 224 the output directory unchanged. 225 This implies no namespaces (turns off <option>-n</option>) and 226 requires <option>-d</option> to specify an output directory. 227 </para> 228 </listitem> 229 </varlistentry> 230 231 <varlistentry> 232 <term><option>-d</option> <replaceable>output-dir</replaceable></term> 233 <listitem> 234 <para> 235 Specifies a directory to contain transformed 236 representations of the input files. 237 By default, <option>-d</option> outputs a canonical representation 238 (described below). 239 You can select different output formats using <option>-c</option>, 240 <option>-m</option> and <option>-N</option>. 241 </para> 242 <para> 243 The output filenames will 244 be exactly the same as the input filenames or "STDIN" if the input is 245 coming from standard input. Therefore, you must be careful that the 246 output file does not go into the same directory as the input 247 file. Otherwise, <command>&dhpackage;</command> will delete the 248 input file before it generates the output file (just like running 249 <literal>cat < file > file</literal> in most shells). 250 </para> 251 <para> 252 Two structurally equivalent XML documents have a byte-for-byte 253 identical canonical XML representation. 254 Note that ignorable white space is considered significant and 255 is treated equivalently to data. 256 More on canonical XML can be found at 257 http://www.jclark.com/xml/canonxml.html . 258 </para> 259 </listitem> 260 </varlistentry> 261 262 <varlistentry> 263 <term><option>-e</option> <replaceable>encoding</replaceable></term> 264 <listitem> 265 <para> 266 Specifies the character encoding for the document, overriding 267 any document encoding declaration. <command>&dhpackage;</command> 268 supports four built-in encodings: 269 <literal>US-ASCII</literal>, 270 <literal>UTF-8</literal>, 271 <literal>UTF-16</literal>, and 272 <literal>ISO-8859-1</literal>. 273 Also see the <option>-w</option> option. 274 </para> 275 </listitem> 276 </varlistentry> 277 278 <varlistentry> 279 <term><option>-g</option> <replaceable>bytes</replaceable></term> 280 <listitem> 281 <para> 282 Sets the buffer size to request per call pair to 283 <function>XML_GetBuffer</function> and <function>read</function> 284 (default: 8 KiB). 285 </para> 286 </listitem> 287 </varlistentry> 288 289 <varlistentry> 290 <term><option>-h</option></term> 291 <term><option>--help</option></term> 292 <listitem> 293 <para> 294 Prints short usage information on command <command>&dhpackage;</command>, 295 and then exits. 296 Similar to this man page but more concise. 297 </para> 298 </listitem> 299 </varlistentry> 300 301 <varlistentry> 302 <term><option>-k</option></term> 303 <listitem> 304 <para> 305 When processing multiple files, <command>&dhpackage;</command> 306 by default halts after the the first file with an error. 307 This tells <command>&dhpackage;</command> to report the error 308 but to keep processing. 309 This can be useful, for example, when testing a filter that converts 310 many files to XML and you want to quickly find out which conversions 311 failed. 312 </para> 313 </listitem> 314 </varlistentry> 315 316 <varlistentry> 317 <term><option>-m</option></term> 318 <listitem> 319 <para> 320 Outputs some strange sort of XML file that completely 321 describes the input file, including character positions. 322 Requires <option>-d</option> to specify an output file. 323 </para> 324 </listitem> 325 </varlistentry> 326 327 <varlistentry> 328 <term><option>-n</option></term> 329 <listitem> 330 <para> 331 Turns on namespace processing. (describe namespaces) 332 <option>-c</option> disables namespaces. 333 </para> 334 </listitem> 335 </varlistentry> 336 337 <varlistentry> 338 <term><option>-N</option></term> 339 <listitem> 340 <para> 341 Adds a doctype and notation declarations to canonical XML output. 342 This matches the example output used by the formal XML test cases. 343 Requires <option>-d</option> to specify an output file. 344 </para> 345 </listitem> 346 </varlistentry> 347 348 <varlistentry> 349 <term><option>-p</option></term> 350 <listitem> 351 <para> 352 Tells <command>&dhpackage;</command> to process external DTDs and parameter 353 entities. 354 </para> 355 <para> 356 Normally <command>&dhpackage;</command> never parses parameter 357 entities. <option>-p</option> tells it to always parse them. 358 <option>-p</option> implies <option>-x</option>. 359 </para> 360 </listitem> 361 </varlistentry> 362 363 <varlistentry> 364 <term><option>-q</option></term> 365 <listitem> 366 <para> 367 Disable reparse deferral, and allow quadratic parse runtime 368 on large tokens (default: reparse deferral enabled). 369 </para> 370 </listitem> 371 </varlistentry> 372 373 <varlistentry> 374 <term><option>-r</option></term> 375 <listitem> 376 <para> 377 Normally <command>&dhpackage;</command> memory-maps the XML file 378 before parsing; this can result in faster parsing on many 379 platforms. 380 <option>-r</option> turns off memory-mapping and uses normal file 381 IO calls instead. 382 Of course, memory-mapping is automatically turned off 383 when reading from standard input. 384 </para> 385 <para> 386 Use of memory-mapping can cause some platforms to report 387 substantially higher memory usage for 388 <command>&dhpackage;</command>, but this appears to be a matter of 389 the operating system reporting memory in a strange way; there is 390 not a leak in <command>&dhpackage;</command>. 391 </para> 392 </listitem> 393 </varlistentry> 394 395 <varlistentry> 396 <term><option>-s</option></term> 397 <listitem> 398 <para> 399 Prints an error if the document is not standalone. 400 A document is standalone if it has no external subset and no 401 references to parameter entities. 402 </para> 403 </listitem> 404 </varlistentry> 405 406 <varlistentry> 407 <term><option>-t</option></term> 408 <listitem> 409 <para> 410 Turns on timings. This tells Expat to parse the entire file, 411 but not perform any processing. 412 This gives a fairly accurate idea of the raw speed of Expat itself 413 without client overhead. 414 <option>-t</option> turns off most of the output options 415 (<option>-d</option>, <option>-m</option>, <option>-c</option>, ...). 416 </para> 417 </listitem> 418 </varlistentry> 419 420 <varlistentry> 421 <term><option>-v</option></term> 422 <term><option>--version</option></term> 423 <listitem> 424 <para> 425 Prints the version of the Expat library being used, including some 426 information on the compile-time configuration of the library, and 427 then exits. 428 </para> 429 </listitem> 430 </varlistentry> 431 432 <varlistentry> 433 <term><option>-w</option></term> 434 <listitem> 435 <para> 436 Enables support for Windows code pages. 437 Normally, <command>&dhpackage;</command> will throw an error if it 438 runs across an encoding that it is not equipped to handle itself. With 439 <option>-w</option>, <command>&dhpackage;</command> will try to use a Windows code 440 page. See also <option>-e</option>. 441 </para> 442 </listitem> 443 </varlistentry> 444 445 <varlistentry> 446 <term><option>-x</option></term> 447 <listitem> 448 <para> 449 Turns on parsing external entities. 450 (CAREFUL! This makes xmlwf vulnerable to external entity attacks (XXE).) 451 </para> 452 <para> 453 Non-validating parsers are not required to resolve external 454 entities, or even expand entities at all. 455 Expat always expands internal entities (?), 456 but external entity parsing must be enabled explicitly. 457 </para> 458 <para> 459 External entities are simply entities that obtain their 460 data from outside the XML file currently being parsed. 461 </para> 462 <para> 463 This is an example of an internal entity: 464 <literallayout> 465<!ENTITY vers '1.0.2'> 466 </literallayout> 467 </para> 468 <para> 469 And here are some examples of external entities: 470 471 <literallayout> 472<!ENTITY header SYSTEM "header-&vers;.xml"> (parsed) 473<!ENTITY logo SYSTEM "logo.png" PNG> (unparsed) 474 </literallayout> 475 </para> 476 </listitem> 477 </varlistentry> 478 479 <varlistentry> 480 <term><option>--</option></term> 481 <listitem> 482 <para> 483 (Two hyphens.) 484 Terminates the list of options. This is only needed if a filename 485 starts with a hyphen. For example: 486 </para> 487 <literallayout> 488&dhpackage; -- -myfile.xml 489 </literallayout> 490 <para> 491 will run <command>&dhpackage;</command> on the file 492 <filename>-myfile.xml</filename>. 493 </para> 494 </listitem> 495 </varlistentry> 496 </variablelist> 497 <para> 498 Older versions of <command>&dhpackage;</command> do not support 499 reading from standard input. 500 </para> 501 </refsect1> 502 503 <refsect1> 504 <title>OUTPUT</title> 505 <para><command>&dhpackage;</command> outputs nothing for files which are problem-free. 506 If any input file is not well-formed, or if the output for any 507 input file cannot be opened, <command>&dhpackage;</command> prints a single 508 line describing the problem to standard output. 509 </para> 510 <para> 511 If the <option>-k</option> option is not provided, <command>&dhpackage;</command> 512 halts upon encountering a well-formedness or output-file error. 513 If <option>-k</option> is provided, <command>&dhpackage;</command> continues 514 processing the remaining input files, describing problems found with any of them. 515 </para> 516 </refsect1> 517 518 <refsect1> 519 <title>EXIT STATUS</title> 520 <para>For options <option>-v</option>|<option>--version</option> or <option>-h</option>|<option>--help</option>, <command>&dhpackage;</command> always exits with status code 0. For other cases, the following exit status codes are returned: 521 <variablelist> 522 <varlistentry> 523 <term><option>0</option></term> 524 <listitem><para>The input files are well-formed and the output (if requested) was written successfully.</para> 525 </listitem> 526 </varlistentry> 527 <varlistentry> 528 <term><option>1</option></term> 529 <listitem><para>An internal error occurred.</para> 530 </listitem> 531 </varlistentry> 532 <varlistentry> 533 <term><option>2</option></term> 534 <listitem><para>One or more input files were not well-formed or could not be parsed.</para> 535 </listitem> 536 </varlistentry> 537 <varlistentry> 538 <term><option>3</option></term> 539 <listitem><para>If using the <option>-d</option> option, an error occurred opening an output file.</para> 540 </listitem> 541 </varlistentry> 542 <varlistentry> 543 <term><option>4</option></term> 544 <listitem><para>There was a command-line argument error in how <command>&dhpackage;</command> was invoked.</para> 545 </listitem> 546 </varlistentry> 547 </variablelist> 548 </para> 549 </refsect1> 550 551 552 <refsect1> 553 <title>BUGS</title> 554 <para> 555 The errors should go to standard error, not standard output. 556 </para> 557 <para> 558 There should be a way to get <option>-d</option> to send its 559 output to standard output rather than forcing the user to send 560 it to a file. 561 </para> 562 <para> 563 I have no idea why anyone would want to use the 564 <option>-d</option>, <option>-c</option>, and 565 <option>-m</option> options. If someone could explain it to 566 me, I'd like to add this information to this manpage. 567 </para> 568 </refsect1> 569 570 <refsect1> 571 <title>SEE ALSO</title> 572 <para> 573 <literallayout> 574The Expat home page: https://libexpat.github.io/ 575The W3 XML 1.0 specification (fourth edition): https://www.w3.org/TR/2006/REC-xml-20060816/ 576Billion laughs attack: https://en.wikipedia.org/wiki/Billion_laughs_attack 577 </literallayout> 578 </para> 579 </refsect1> 580 581 <refsect1> 582 <title>AUTHOR</title> 583 <para> 584 This manual page was originally written by &dhusername; &dhemail; 585 in December 2001 for 586 the &debian; system (but may be used by others). Permission is 587 granted to copy, distribute and/or modify this document under 588 the terms of the &gnu; Free Documentation 589 License, Version 1.1. 590 </para> 591 </refsect1> 592</refentry> 593