llvm-mca.1 - OpenGrok cross reference for /freebsd/usr.bin/clang/llvm-mca/llvm-mca.1

Lines Matching +full:pressure +full:- +full:max
4 .nr rst2man-indent-level 0
7 \\$1 \\n[an-margin]
8 level \\n[rst2man-indent-level]
9 level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
10 -
11 \\n[rst2man-indent0]
12 \\n[rst2man-indent1]
13 \\n[rst2man-indent2]
18 . nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
19 . nr rst2man-indent-level +1
24 .\" indent \\n[an-margin]
25 .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
26 .nr rst2man-indent-level -1
27 .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
30 .TH "LLVM-MCA" "1" "2023-05-24" "16" "LLVM"
32 llvm-mca \- LLVM Machine Code Analyzer
35 \fBllvm\-mca\fP [\fIoptions\fP] [input]
38 \fBllvm\-mca\fP is a performance analysis tool that uses information
50 Given an assembly code sequence, \fBllvm\-mca\fP estimates the Instructions
51 Per Cycle (IPC), as well as hardware resource pressure. The analysis and
55 directly into \fBllvm\-mca\fP for analysis:
61 $ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-S \-o \- | llvm\-mca \-mcpu=btver2
73 $ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-mllvm \-x86\-asm\-syntax=intel \-S \-o \- | …
79 (\fBllvm\-mca\fP detects Intel syntax by the presence of an \fI\&.intel_syntax\fP
87 By design, the quality of the analysis conducted by \fBllvm\-mca\fP is
95 If \fBinput\fP is \(dq\fB\-\fP\(dq or omitted, \fBllvm\-mca\fP reads from standard
98 If the \fI\%\-o\fP option is omitted, then \fBllvm\-mca\fP will send its output
99 to standard output if the input is from standard input.  If the \fI\%\-o\fP
100 option specifies \(dq\fB\-\fP\(dq, then the output will also be sent to standard output.
103 .B \-help
108 .B \-o <filename>
114 .B \-mtriple=<target triple>
119 .B \-march=<arch>
125 .B \-mcpu=<cpuname>
131 .B \-output\-asm\-variant=<variant id>
139 .B \-print\-imm\-hex
145 .B \-dispatch=<width>
152 .B \-register\-file\-size=<size>
159 .B \-iterations=<number of iterations>
165 .B \-noalias=<bool>
171 .B \-lqueue=<load queue size>
179 .B \-squeue=<store queue size>
187 .B \-timeline
192 .B \-timeline\-max\-iterations=<iterations>
198 .B \-timeline\-max\-cycles=<cycles>
204 .B \-resource\-pressure
205 Enable the resource pressure view. This is enabled by default.
209 .B \-register\-file\-stats
214 .B \-dispatch\-stats
221 .B \-scheduler\-stats
227 .B \-retire\-stats
232 .B \-instruction\-info
237 .B \-show\-encoding
242 .B \-show\-barriers
248 .B \-all\-stats
255 .B \-all\-views
260 .B \-instruction\-tables
261 Prints resource pressure information based on the static information
262 available from the processor model. This differs from the resource pressure
264 the theoretical uniform distribution of resource pressure for every
269 .B \-bottleneck\-analysis
273 processors with an in\-order backend.
277 .B \-json
286 .B \-disable\-cb
294 .B \-disable\-im
302 \fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed
306 \fBllvm\-mca\fP allows for the optional usage of special code comments to
308 substring \fBLLVM\-MCA\-BEGIN\fP marks the beginning of an analysis region. A
309 comment starting with substring \fBLLVM\-MCA\-END\fP marks the end of a region.
316 # LLVM\-MCA\-BEGIN
318 # LLVM\-MCA\-END
324 If no user\-defined region is specified, then \fBllvm\-mca\fP assumes a
335 # LLVM\-MCA\-BEGIN A simple example
337 # LLVM\-MCA\-END
345 in the \fBLLVM\-MCA\-END\fP directive. In the absence of overlapping regions,
346 an anonymous \fBLLVM\-MCA\-END\fP directive always ends the currently active user
355 # LLVM\-MCA\-BEGIN foo
357 # LLVM\-MCA\-BEGIN bar
359 # LLVM\-MCA\-END bar
360 # LLVM\-MCA\-END foo
372 # LLVM\-MCA\-BEGIN foo
374 # LLVM\-MCA\-BEGIN bar
376 # LLVM\-MCA\-END foo
378 # LLVM\-MCA\-END bar
387 There is no support for marking regions from high\-level source code, like C or
395   __asm volatile(\(dq# LLVM\-MCA\-BEGIN foo\(dq:::\(dqmemory\(dq);
397   __asm volatile(\(dq# LLVM\-MCA\-END\(dq:::\(dqmemory\(dq);
417 special LLVM\-MCA comment directives.
423 # LLVM\-MCA\-<INSTRUMENT_TYPE> <data>
433 A comment starting with substring \fILLVM\-MCA\-<INSTRUMENT_TYPE>\fP
434 brings data into scope for llvm\-mca to use in its analysis for
445 Comments that are prefixed with \fILLVM\-MCA\-\fP but do not correspond to
448 that do not start with \fILLVM\-MCA\-\fP are ignored by :program \fIllvm\-mca\fP\&.
461 # LLVM\-MCA\-RISCV\-LMUL <M1|M2|M4|M8|MF2|MF4|MF8>
468 instructions to use the scheduling behaviour of its pseudo\-instruction
479 # LLVM\-MCA\-RISCV\-LMUL M2
493 # LLVM\-MCA\-RISCV\-LMUL M1
507 # LLVM\-MCA\-RISCV\-LMUL M1
510 # LLVM\-MCA\-RISCV\-LMUL M8
524 # LLVM\-MCA\-RISCV\-LMUL M1
527 # LLVM\-MCA\-RISCV\-LMUL M4
533 .SH HOW LLVM-MCA WORKS
535 \fBllvm\-mca\fP takes assembly code as input. The assembly code is parsed
546 dot\-product of two packed float vectors of four elements. The analysis is
549 \fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP:
555 $ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=300 dot\-product.s
591 [0]   \- JALU0
592 [1]   \- JALU1
593 [2]   \- JDiv
594 [3]   \- JFPA
595 [4]   \- JFPM
596 [5]   \- JFPU0
597 [6]   \- JFPU1
598 [7]   \- JLAGU
599 [8]   \- JMul
600 [9]   \- JSAGU
601 [10]  \- JSTC
602 [11]  \- JVALU0
603 [12]  \- JVALU1
604 [13]  \- JVIMUL
607 Resource pressure per iteration:
609 …\-      \-      \-     2.00   1.00   2.00   1.00    \-      \-      \-      \-      \-      \-    …
611 Resource pressure by instruction:
613 …\-      \-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-  …
614 …\-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-  …
615 …\-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-  …
621 According to this report, the dot\-product kernel has been executed 300 times,
632 to the out\-of\-order backend every simulated cycle. For processors with an
633 in\-order backend, \fIDispatchWidth\fP is the maximum number of micro opcodes issued
645 In the absence of loop\-carried data dependencies, the observed IPC tends to a
651 field is an indicator of a performance issue. In the absence of loop\-carried
659 availability of hardware resources affects the resource pressure distribution,
668 are no loop\-carried dependencies, the observed \fIuOps Per Cycle\fP is expected to
672 resources, and the \fIResource pressure view\fP can help to identify the problematic
688 \fI\-show\-encoding\fP is specified.
690 Below is an example of \fI\-show\-encoding\fP output for the dot\-product kernel:
718 The third section is the \fIResource pressure view\fP\&.  This view reports
725 (JFPU1 \- floating point pipeline #1), consuming an average of 1 resource cycle
726 per iteration.  Note that on AMD Jaguar, vector floating\-point multiply can
727 only be issued to pipeline JFPU1, while horizontal floating\-point additions can
730 The resource pressure view helps with identifying bottlenecks caused by high
731 usage of specific hardware resources.  Situations with resource pressure mainly
733 pressure should be uniformly distributed between multiple resources.
738 command line option \fB\-timeline\fP\&.  As instructions transition through the
753 \- : Instruction executed, waiting to be retired.
756 Below is the timeline view for a subset of the dot\-product example located in
757 \fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP and processed by
758 \fBllvm\-mca\fP using the following command:
764 $ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=3 \-timeline dot\-product…
781 [1,0]     .DeeE\-\-\-\-\-R    .   vmulps   %xmm0, %xmm1, %xmm2
782 [1,1]     . D=eeeE\-\-\-R   .   vhaddps  %xmm2, %xmm2, %xmm3
784 [2,0]     .  DeeE\-\-\-\-\-R  .   vmulps   %xmm0, %xmm1, %xmm2
813 sub\-optimal usage of hardware resources.
818 example was generated using 3 iterations: \fB\-iterations=3\fP, the iteration
819 indices range from 0\-2 inclusively.
843 There is a gap of 5 cycles between the write\-back stage and the retire event.
853 In the dot\-product example, there are anti\-dependencies introduced by
861 instructions measured. Note that \fBllvm\-mca\fP, by default, assumes at
871 cycles spent in the queue tends to be larger (i.e., more than 1\-3cy),
875 The \fB\-bottleneck\-analysis\fP command line option enables the analysis of
879 backend pressure (caused by pipeline resource pressure and data dependencies) to
882 Below is an example of \fB\-bottleneck\-analysis\fP output generated by
883 \fBllvm\-mca\fP for 500 iterations of the dot\-product example on btver2.
889 Cycles with backend pressure increase [ 48.07% ]
891   Resource Pressure       [ 47.77% ]
892   \- JFPA  [ 47.77% ]
893   \- JFPU0  [ 47.77% ]
895   \- Register Dependencies [ 0.30% ]
896   \- Memory Dependencies   [ 0.00% ]
901  +\-\-\-\-< 2.    vhaddps %xmm3, %xmm3, %xmm4
906 …+\-\-\-\-> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability…
907  +\-\-\-\-> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
911 …+\-\-\-\-> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability…
917 According to the analysis, throughput is limited by resource pressure and not by
918 data dependencies.  The analysis observed increases in backend pressure during
919 48.07% of the simulated run. Almost all those pressure increase events were
931 Bottleneck analysis is currently not supported for processors with an in\-order
935 The \fB\-all\-stats\fP command line option enables extra statistics and performance
939 Below is an example of \fB\-all\-stats\fP output generated by  \fBllvm\-mca\fP
940 for 300 iterations of the dot\-product example discussed in the previous
948 RAT     \- Register unavailable:                      0
949 RCU     \- Retire tokens unavailable:                 0
950 SCHEDQ  \- Scheduler full:                            272  (44.6%)
951 LQ      \- Load queue full:                           0
952 SQ      \- Store queue full:                          0
953 GROUP   \- Static restrictions on the dispatch group: 0
956 Dispatch Logic \- number of cycles where we saw N micro opcodes dispatched:
963 Schedulers \- number of cycles where we saw N micro opcodes issued:
981 Retire Control Unit \- number of cycles where we saw N instructions retired:
996 *  Register File #1 \-\- JFpuPRF:
999    Max number of mappings used:      35
1001 *  Register File #2 \-\- JIntegerPRF:
1004    Max number of mappings used:      0
1018 \fB\-all\-stats\fP or \fB\-dispatch\-stats\fP\&.
1031 JALU01 \- A scheduler for ALU instructions.
1033 JFPU01 \- A scheduler floating point operations.
1035 JLSAGU \- A scheduler for address generation.
1038 The dot\-product is a kernel of three floating point instructions (a vector
1043 sub\-optimal usage of hardware resources.  Sometimes, resource pressure can be
1047 statistics are displayed by using the command option \fB\-all\-stats\fP or
1048 \fB\-scheduler\-stats\fP\&.
1055 \fB\-all\-stats\fP or \fB\-retire\-stats\fP\&.
1059 Jaguar, there are two register files, one for floating\-point registers (JFpuPRF)
1061 instructions processed, there were 900 mappings created.  Since this dot\-product
1067 displayed by using the command option \fB\-all\-stats\fP or
1068 \fB\-register\-file\-stats\fP\&.
1071 dependencies, and not by resource pressure.
1075 \fBllvm\-mca\fP, as well as the functional units involved in the process.
1090 The in\-order pipeline implements the following sequence of stages:
1094 \fBllvm\-mca\fP assumes that instructions have all been decoded and placed
1097 diagnosed. Also, \fBllvm\-mca\fP does not model branch prediction.
1121 the processor. \fBllvm\-mca\fP uses that information to initialize register
1124 \fB\-register\-file\-size\fP\&.  A value of zero for this option means \fIunbounded\fP\&. By
1129 number of micro\-opcodes specified for that instruction by the target scheduling
1131 instructions that are \(dqin\-flight\(dq, and retiring them in program order.  The
1136 entries. \fBllvm\-mca\fP queries the scheduling model to determine the set
1144 execution and may be issued (potentially out\-of\-order) for execution.
1145 Instruction latencies are computed by \fBllvm\-mca\fP with the help of the
1148 \fBllvm\-mca\fP\(aqs scheduler is designed to simulate multiple processor
1156 round\-robin selector to guarantee that resource usage is uniformly distributed
1159 \fBllvm\-mca\fP\(aqs scheduler internally groups instructions into three sets:
1176 .SS Write\-Back and Retire Stage
1179 instructions wait until they reach the write\-back stage.  At that point, they
1190 To simulate an out\-of\-order execution of memory operations, \fBllvm\-mca\fP
1195 specify flags \fB\-lqueue\fP and \fB\-squeue\fP to limit the number of entries in the
1214 (\fI\-noalias=true\fP) store operations.  Under this assumption, younger loads are
1219 Note that, in the case of write\-combining memory, rule 3 could be relaxed to
1220 allow reordering of non\-aliasing store operations.  That being said, at the
1221 moment, there is no way to further relax the memory model (\fB\-noalias\fP is the
1223 type (e.g., write\-back, write\-combining, write\-through; etc.) and consequently
1229 The LSUnit does not know when store\-to\-load forwarding may occur.
1239 loads, the scheduling model provides an \(dqoptimistic\(dq load\-to\-use latency (which
1240 usually matches the load\-to\-use latency for when there is a hit in the L1D).
1242 \fBllvm\-mca\fP does not (on its own) know about serializing operations or
1243 memory\-barrier like instructions.  The LSUnit used to conservatively use an
1245 determine whether an instruction should be treated as a memory\-barrier. This was
1265 A store may not pass a previous load (regardless of \fB\-noalias\fP).
1271 A load may not pass a previous store unless \fB\-noalias\fP is set.
1275 .SS In\-order Issue and Execute
1277 In\-order processors are modelled as a single \fBInOrderIssueStage\fP stage. It
1284 retire. \fBllvm\-mca\fP ensures that writes are committed in\-order. However,
1285 an instruction is allowed to commit writes and retire out\-of\-order if
1290 scheduling model, \fBllvm\-mca\fP isn\(aqt always able to simulate them
1296 hazards that \fBllvm\-mca\fP has no way of knowing about).
1298 \fBllvm\-mca\fP comes with one generic and multiple target specific
1299 CustomBehaviour classes. The generic class will be used if the \fB\-disable\-cb\fP
1302 class is only a part of the in\-order pipeline, but there are plans to add it
1303 to the out\-of\-order pipeline in the future.
1335 LMUL to the scheduling class of the pseudo\-instruction that describes
1339 \fBllvm\-mca\fP comes with several Views such as the Timeline View and
1341 targets. If you wish to add a new View to \fBllvm\-mca\fP and it does not
1344 \fI/tools/llvm\-mca/View/\fP directory. However, if your new View is target specific
1358 the \fI\-disable\-cb\fP flag is used.
1360 Enabling these custom Views does not affect the non\-custom (generic) Views.
1366 2003-2023, LLVM Project