llvm-mca.1 - OpenGrok cross reference for /freebsd/usr.bin/clang/llvm-mca/llvm-mca.1

Lines Matching +full:cycle +full:- +full:6
4 .nr rst2man-indent-level 0
7 \\$1 \\n[an-margin]
8 level \\n[rst2man-indent-level]
9 level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
10 -
11 \\n[rst2man-indent0]
12 \\n[rst2man-indent1]
13 \\n[rst2man-indent2]
18 . nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
19 . nr rst2man-indent-level +1
24 .\" indent \\n[an-margin]
25 .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
26 .nr rst2man-indent-level -1
27 .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
30 .TH "LLVM-MCA" "1" "2023-05-24" "16" "LLVM"
32 llvm-mca \- LLVM Machine Code Analyzer
35 \fBllvm\-mca\fP [\fIoptions\fP] [input]
38 \fBllvm\-mca\fP is a performance analysis tool that uses information
50 Given an assembly code sequence, \fBllvm\-mca\fP estimates the Instructions
51 Per Cycle (IPC), as well as hardware resource pressure. The analysis and
55 directly into \fBllvm\-mca\fP for analysis:
61 $ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-S \-o \- | llvm\-mca \-mcpu=btver2
73 $ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-mllvm \-x86\-asm\-syntax=intel \-S \-o \- | …
79 (\fBllvm\-mca\fP detects Intel syntax by the presence of an \fI\&.intel_syntax\fP
87 By design, the quality of the analysis conducted by \fBllvm\-mca\fP is
95 If \fBinput\fP is \(dq\fB\-\fP\(dq or omitted, \fBllvm\-mca\fP reads from standard
98 If the \fI\%\-o\fP option is omitted, then \fBllvm\-mca\fP will send its output
99 to standard output if the input is from standard input.  If the \fI\%\-o\fP
100 option specifies \(dq\fB\-\fP\(dq, then the output will also be sent to standard output.
103 .B \-help
108 .B \-o <filename>
114 .B \-mtriple=<target triple>
119 .B \-march=<arch>
125 .B \-mcpu=<cpuname>
131 .B \-output\-asm\-variant=<variant id>
139 .B \-print\-imm\-hex
145 .B \-dispatch=<width>
152 .B \-register\-file\-size=<size>
159 .B \-iterations=<number of iterations>
165 .B \-noalias=<bool>
171 .B \-lqueue=<load queue size>
179 .B \-squeue=<store queue size>
187 .B \-timeline
192 .B \-timeline\-max\-iterations=<iterations>
198 .B \-timeline\-max\-cycles=<cycles>
204 .B \-resource\-pressure
209 .B \-register\-file\-stats
214 .B \-dispatch\-stats
221 .B \-scheduler\-stats
227 .B \-retire\-stats
232 .B \-instruction\-info
237 .B \-show\-encoding
242 .B \-show\-barriers
248 .B \-all\-stats
255 .B \-all\-views
260 .B \-instruction\-tables
269 .B \-bottleneck\-analysis
273 processors with an in\-order backend.
277 .B \-json
286 .B \-disable\-cb
294 .B \-disable\-im
302 \fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed
306 \fBllvm\-mca\fP allows for the optional usage of special code comments to
308 substring \fBLLVM\-MCA\-BEGIN\fP marks the beginning of an analysis region. A
309 comment starting with substring \fBLLVM\-MCA\-END\fP marks the end of a region.
316 # LLVM\-MCA\-BEGIN
318 # LLVM\-MCA\-END
324 If no user\-defined region is specified, then \fBllvm\-mca\fP assumes a
335 # LLVM\-MCA\-BEGIN A simple example
337 # LLVM\-MCA\-END
345 in the \fBLLVM\-MCA\-END\fP directive. In the absence of overlapping regions,
346 an anonymous \fBLLVM\-MCA\-END\fP directive always ends the currently active user
355 # LLVM\-MCA\-BEGIN foo
357 # LLVM\-MCA\-BEGIN bar
359 # LLVM\-MCA\-END bar
360 # LLVM\-MCA\-END foo
372 # LLVM\-MCA\-BEGIN foo
374 # LLVM\-MCA\-BEGIN bar
376 # LLVM\-MCA\-END foo
378 # LLVM\-MCA\-END bar
387 There is no support for marking regions from high\-level source code, like C or
395   __asm volatile(\(dq# LLVM\-MCA\-BEGIN foo\(dq:::\(dqmemory\(dq);
397   __asm volatile(\(dq# LLVM\-MCA\-END\(dq:::\(dqmemory\(dq);
417 special LLVM\-MCA comment directives.
423 # LLVM\-MCA\-<INSTRUMENT_TYPE> <data>
433 A comment starting with substring \fILLVM\-MCA\-<INSTRUMENT_TYPE>\fP
434 brings data into scope for llvm\-mca to use in its analysis for
445 Comments that are prefixed with \fILLVM\-MCA\-\fP but do not correspond to
448 that do not start with \fILLVM\-MCA\-\fP are ignored by :program \fIllvm\-mca\fP\&.
461 # LLVM\-MCA\-RISCV\-LMUL <M1|M2|M4|M8|MF2|MF4|MF8>
468 instructions to use the scheduling behaviour of its pseudo\-instruction
479 # LLVM\-MCA\-RISCV\-LMUL M2
493 # LLVM\-MCA\-RISCV\-LMUL M1
507 # LLVM\-MCA\-RISCV\-LMUL M1
510 # LLVM\-MCA\-RISCV\-LMUL M8
524 # LLVM\-MCA\-RISCV\-LMUL M1
527 # LLVM\-MCA\-RISCV\-LMUL M4
533 .SH HOW LLVM-MCA WORKS
535 \fBllvm\-mca\fP takes assembly code as input. The assembly code is parsed
546 dot\-product of two packed float vectors of four elements. The analysis is
549 \fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP:
555 $ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=300 dot\-product.s
571 uOps Per Cycle:    1.48
582 [6]: HasSideEffects (U)
584 [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
591 [0]   \- JALU0
592 [1]   \- JALU1
593 [2]   \- JDiv
594 [3]   \- JFPA
595 [4]   \- JFPM
596 [5]   \- JFPU0
597 [6]   \- JFPU1
598 [7]   \- JLAGU
599 [8]   \- JMul
600 [9]   \- JSAGU
601 [10]  \- JSTC
602 [11]  \- JVALU0
603 [12]  \- JVALU1
604 [13]  \- JVIMUL
608 [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
609 …\-      \-      \-     2.00   1.00   2.00   1.00    \-      \-      \-      \-      \-      \-    …
612 [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   I…
613 …\-      \-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-  …
614 …\-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-  …
615 …\-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-  …
621 According to this report, the dot\-product kernel has been executed 300 times,
628 \fBIPC\fP, \fBuOps Per Cycle\fP, and  \fBBlock RThroughput\fP (Block Reciprocal
632 to the out\-of\-order backend every simulated cycle. For processors with an
633 in\-order backend, \fIDispatchWidth\fP is the maximum number of micro opcodes issued
634 to the backend every simulated cycle.
641 (i.e. iterations) that can be executed per simulated clock cycle in the absence
645 In the absence of loop\-carried data dependencies, the observed IPC tends to a
649 Field \(aquOps Per Cycle\(aq is computed dividing the total number of simulated micro
651 field is an indicator of a performance issue. In the absence of loop\-carried
652 data dependencies, the observed \(aquOps Per Cycle\(aq should tend to a theoretical
656 Field \fIuOps Per Cycle\fP is bounded from above by the dispatch width. That is
658 and \(aquOps Per Cycle\(aq are limited by the amount of hardware parallelism. The
661 cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
662 Cycle (computed by dividing the number of uOps of a single iteration by the
668 are no loop\-carried dependencies, the observed \fIuOps Per Cycle\fP is expected to
682 executed per clock cycle in the absence of operand dependencies. In this
688 \fI\-show\-encoding\fP is specified.
690 Below is an example of \fI\-show\-encoding\fP output for the dot\-product kernel:
702 [6]: HasSideEffects (U)
705 [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
724 of the instruction vmulps always executes on resource unit [6]
725 (JFPU1 \- floating point pipeline #1), consuming an average of 1 resource cycle
726 per iteration.  Note that on AMD Jaguar, vector floating\-point multiply can
727 only be issued to pipeline JFPU1, while horizontal floating\-point additions can
738 command line option \fB\-timeline\fP\&.  As instructions transition through the
753 \- : Instruction executed, waiting to be retired.
756 Below is the timeline view for a subset of the dot\-product example located in
757 \fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP and processed by
758 \fBllvm\-mca\fP using the following command:
764 $ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=3 \-timeline dot\-product…
781 [1,0]     .DeeE\-\-\-\-\-R    .   vmulps   %xmm0, %xmm1, %xmm2
782 [1,1]     . D=eeeE\-\-\-R   .   vhaddps  %xmm2, %xmm2, %xmm3
784 [2,0]     .  DeeE\-\-\-\-\-R  .   vmulps   %xmm0, %xmm1, %xmm2
813 sub\-optimal usage of hardware resources.
818 example was generated using 3 iterations: \fB\-iterations=3\fP, the iteration
819 indices range from 0\-2 inclusively.
827 Instruction [1,0] was dispatched at cycle 1.
829 Instruction [1,0] started executing at cycle 2.
831 Instruction [1,0] reached the write back stage at cycle 4.
833 Instruction [1,0] was retired at cycle 10.
843 There is a gap of 5 cycles between the write\-back stage and the retire event.
845 for [0,2] to be retired first (i.e., it has to wait until cycle 10).
853 In the dot\-product example, there are anti\-dependencies introduced by
861 instructions measured. Note that \fBllvm\-mca\fP, by default, assumes at
871 cycles spent in the queue tends to be larger (i.e., more than 1\-3cy),
875 The \fB\-bottleneck\-analysis\fP command line option enables the analysis of
882 Below is an example of \fB\-bottleneck\-analysis\fP output generated by
883 \fBllvm\-mca\fP for 500 iterations of the dot\-product example on btver2.
892   \- JFPA  [ 47.77% ]
893   \- JFPU0  [ 47.77% ]
895   \- Register Dependencies [ 0.30% ]
896   \- Memory Dependencies   [ 0.00% ]
901  +\-\-\-\-< 2.    vhaddps %xmm3, %xmm3, %xmm4
906 …+\-\-\-\-> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability…
907  +\-\-\-\-> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
911 …+\-\-\-\-> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability…
931 Bottleneck analysis is currently not supported for processors with an in\-order
935 The \fB\-all\-stats\fP command line option enables extra statistics and performance
939 Below is an example of \fB\-all\-stats\fP output generated by  \fBllvm\-mca\fP
940 for 300 iterations of the dot\-product example discussed in the previous
948 RAT     \- Register unavailable:                      0
949 RCU     \- Retire tokens unavailable:                 0
950 SCHEDQ  \- Scheduler full:                            272  (44.6%)
951 LQ      \- Load queue full:                           0
952 SQ      \- Store queue full:                          0
953 GROUP   \- Static restrictions on the dispatch group: 0
956 Dispatch Logic \- number of cycles where we saw N micro opcodes dispatched:
963 Schedulers \- number of cycles where we saw N micro opcodes issued:
981 Retire Control Unit \- number of cycles where we saw N instructions retired:
996 *  Register File #1 \-\- JFpuPRF:
1001 *  Register File #2 \-\- JIntegerPRF:
1018 \fB\-all\-stats\fP or \fB\-dispatch\-stats\fP\&.
1031 JALU01 \- A scheduler for ALU instructions.
1033 JFPU01 \- A scheduler floating point operations.
1035 JLSAGU \- A scheduler for address generation.
1038 The dot\-product is a kernel of three floating point instructions (a vector
1043 sub\-optimal usage of hardware resources.  Sometimes, resource pressure can be
1047 statistics are displayed by using the command option \fB\-all\-stats\fP or
1048 \fB\-scheduler\-stats\fP\&.
1053 same cycle 399 times (65.4%) and there were 109 cycles where no instructions
1055 \fB\-all\-stats\fP or \fB\-retire\-stats\fP\&.
1059 Jaguar, there are two register files, one for floating\-point registers (JFpuPRF)
1061 instructions processed, there were 900 mappings created.  Since this dot\-product
1067 displayed by using the command option \fB\-all\-stats\fP or
1068 \fB\-register\-file\-stats\fP\&.
1075 \fBllvm\-mca\fP, as well as the functional units involved in the process.
1090 The in\-order pipeline implements the following sequence of stages:
1094 \fBllvm\-mca\fP assumes that instructions have all been decoded and placed
1097 diagnosed. Also, \fBllvm\-mca\fP does not model branch prediction.
1121 the processor. \fBllvm\-mca\fP uses that information to initialize register
1124 \fB\-register\-file\-size\fP\&.  A value of zero for this option means \fIunbounded\fP\&. By
1129 number of micro\-opcodes specified for that instruction by the target scheduling
1131 instructions that are \(dqin\-flight\(dq, and retiring them in program order.  The
1136 entries. \fBllvm\-mca\fP queries the scheduling model to determine the set
1144 execution and may be issued (potentially out\-of\-order) for execution.
1145 Instruction latencies are computed by \fBllvm\-mca\fP with the help of the
1148 \fBllvm\-mca\fP\(aqs scheduler is designed to simulate multiple processor
1156 round\-robin selector to guarantee that resource usage is uniformly distributed
1159 \fBllvm\-mca\fP\(aqs scheduler internally groups instructions into three sets:
1172 Every cycle, the scheduler checks if instructions can be moved from the WaitSet
1176 .SS Write\-Back and Retire Stage
1179 instructions wait until they reach the write\-back stage.  At that point, they
1190 To simulate an out\-of\-order execution of memory operations, \fBllvm\-mca\fP
1195 specify flags \fB\-lqueue\fP and \fB\-squeue\fP to limit the number of entries in the
1214 (\fI\-noalias=true\fP) store operations.  Under this assumption, younger loads are
1219 Note that, in the case of write\-combining memory, rule 3 could be relaxed to
1220 allow reordering of non\-aliasing store operations.  That being said, at the
1221 moment, there is no way to further relax the memory model (\fB\-noalias\fP is the
1223 type (e.g., write\-back, write\-combining, write\-through; etc.) and consequently
1229 The LSUnit does not know when store\-to\-load forwarding may occur.
1239 loads, the scheduling model provides an \(dqoptimistic\(dq load\-to\-use latency (which
1240 usually matches the load\-to\-use latency for when there is a hit in the L1D).
1242 \fBllvm\-mca\fP does not (on its own) know about serializing operations or
1243 memory\-barrier like instructions.  The LSUnit used to conservatively use an
1245 determine whether an instruction should be treated as a memory\-barrier. This was
1265 A store may not pass a previous load (regardless of \fB\-noalias\fP).
1271 A load may not pass a previous store unless \fB\-noalias\fP is set.
1272 .IP 6. 3
1275 .SS In\-order Issue and Execute
1277 In\-order processors are modelled as a single \fBInOrderIssueStage\fP stage. It
1280 met. Multiple instructions can be issued in one cycle according to the value of
1284 retire. \fBllvm\-mca\fP ensures that writes are committed in\-order. However,
1285 an instruction is allowed to commit writes and retire out\-of\-order if
1290 scheduling model, \fBllvm\-mca\fP isn\(aqt always able to simulate them
1296 hazards that \fBllvm\-mca\fP has no way of knowing about).
1298 \fBllvm\-mca\fP comes with one generic and multiple target specific
1299 CustomBehaviour classes. The generic class will be used if the \fB\-disable\-cb\fP
1302 class is only a part of the in\-order pipeline, but there are plans to add it
1303 to the out\-of\-order pipeline in the future.
1335 LMUL to the scheduling class of the pseudo\-instruction that describes
1339 \fBllvm\-mca\fP comes with several Views such as the Timeline View and
1341 targets. If you wish to add a new View to \fBllvm\-mca\fP and it does not
1344 \fI/tools/llvm\-mca/View/\fP directory. However, if your new View is target specific
1358 the \fI\-disable\-cb\fP flag is used.
1360 Enabling these custom Views does not affect the non\-custom (generic) Views.
1366 2003-2023, LLVM Project