llvm-mca.1 - OpenGrok cross reference for /freebsd/usr.bin/clang/llvm-mca/llvm-mca.1

Lines Matching +full:force +full:- +full:m1
4 .nr rst2man-indent-level 0
7 \\$1 \\n[an-margin]
8 level \\n[rst2man-indent-level]
9 level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
10 -
11 \\n[rst2man-indent0]
12 \\n[rst2man-indent1]
13 \\n[rst2man-indent2]
18 . nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
19 . nr rst2man-indent-level +1
24 .\" indent \\n[an-margin]
25 .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
26 .nr rst2man-indent-level -1
27 .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
30 .TH "LLVM-MCA" "1" "2023-05-24" "16" "LLVM"
32 llvm-mca \- LLVM Machine Code Analyzer
35 \fBllvm\-mca\fP [\fIoptions\fP] [input]
38 \fBllvm\-mca\fP is a performance analysis tool that uses information
50 Given an assembly code sequence, \fBllvm\-mca\fP estimates the Instructions
55 directly into \fBllvm\-mca\fP for analysis:
61 $ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-S \-o \- | llvm\-mca \-mcpu=btver2
73 $ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-mllvm \-x86\-asm\-syntax=intel \-S \-o \- | …
79 (\fBllvm\-mca\fP detects Intel syntax by the presence of an \fI\&.intel_syntax\fP
87 By design, the quality of the analysis conducted by \fBllvm\-mca\fP is
95 If \fBinput\fP is \(dq\fB\-\fP\(dq or omitted, \fBllvm\-mca\fP reads from standard
98 If the \fI\%\-o\fP option is omitted, then \fBllvm\-mca\fP will send its output
99 to standard output if the input is from standard input.  If the \fI\%\-o\fP
100 option specifies \(dq\fB\-\fP\(dq, then the output will also be sent to standard output.
103 .B \-help
108 .B \-o <filename>
114 .B \-mtriple=<target triple>
119 .B \-march=<arch>
125 .B \-mcpu=<cpuname>
131 .B \-output\-asm\-variant=<variant id>
139 .B \-print\-imm\-hex
145 .B \-dispatch=<width>
152 .B \-register\-file\-size=<size>
159 .B \-iterations=<number of iterations>
165 .B \-noalias=<bool>
171 .B \-lqueue=<load queue size>
179 .B \-squeue=<store queue size>
187 .B \-timeline
192 .B \-timeline\-max\-iterations=<iterations>
198 .B \-timeline\-max\-cycles=<cycles>
204 .B \-resource\-pressure
209 .B \-register\-file\-stats
214 .B \-dispatch\-stats
221 .B \-scheduler\-stats
227 .B \-retire\-stats
232 .B \-instruction\-info
237 .B \-show\-encoding
242 .B \-show\-barriers
248 .B \-all\-stats
255 .B \-all\-views
260 .B \-instruction\-tables
269 .B \-bottleneck\-analysis
273 processors with an in\-order backend.
277 .B \-json
286 .B \-disable\-cb
287 Force usage of the generic CustomBehaviour and InstrPostProcess classes rather
294 .B \-disable\-im
295 Force usage of the generic InstrumentManager rather than using the target
302 \fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed
306 \fBllvm\-mca\fP allows for the optional usage of special code comments to
308 substring \fBLLVM\-MCA\-BEGIN\fP marks the beginning of an analysis region. A
309 comment starting with substring \fBLLVM\-MCA\-END\fP marks the end of a region.
316 # LLVM\-MCA\-BEGIN
318 # LLVM\-MCA\-END
324 If no user\-defined region is specified, then \fBllvm\-mca\fP assumes a
335 # LLVM\-MCA\-BEGIN A simple example
337 # LLVM\-MCA\-END
345 in the \fBLLVM\-MCA\-END\fP directive. In the absence of overlapping regions,
346 an anonymous \fBLLVM\-MCA\-END\fP directive always ends the currently active user
355 # LLVM\-MCA\-BEGIN foo
357 # LLVM\-MCA\-BEGIN bar
359 # LLVM\-MCA\-END bar
360 # LLVM\-MCA\-END foo
372 # LLVM\-MCA\-BEGIN foo
374 # LLVM\-MCA\-BEGIN bar
376 # LLVM\-MCA\-END foo
378 # LLVM\-MCA\-END bar
387 There is no support for marking regions from high\-level source code, like C or
395   __asm volatile(\(dq# LLVM\-MCA\-BEGIN foo\(dq:::\(dqmemory\(dq);
397   __asm volatile(\(dq# LLVM\-MCA\-END\(dq:::\(dqmemory\(dq);
417 special LLVM\-MCA comment directives.
423 # LLVM\-MCA\-<INSTRUMENT_TYPE> <data>
433 A comment starting with substring \fILLVM\-MCA\-<INSTRUMENT_TYPE>\fP
434 brings data into scope for llvm\-mca to use in its analysis for
445 Comments that are prefixed with \fILLVM\-MCA\-\fP but do not correspond to
448 that do not start with \fILLVM\-MCA\-\fP are ignored by :program \fIllvm\-mca\fP\&.
461 # LLVM\-MCA\-RISCV\-LMUL <M1|M2|M4|M8|MF2|MF4|MF8>
468 instructions to use the scheduling behaviour of its pseudo\-instruction
479 # LLVM\-MCA\-RISCV\-LMUL M2
492 vsetvli zero, a0, e8, m1, tu, mu
493 # LLVM\-MCA\-RISCV\-LMUL M1
506 vsetvli zero, a0, e8, m1, tu, mu
507 # LLVM\-MCA\-RISCV\-LMUL M1
510 # LLVM\-MCA\-RISCV\-LMUL M8
524 # LLVM\-MCA\-RISCV\-LMUL M1
527 # LLVM\-MCA\-RISCV\-LMUL M4
533 .SH HOW LLVM-MCA WORKS
535 \fBllvm\-mca\fP takes assembly code as input. The assembly code is parsed
546 dot\-product of two packed float vectors of four elements. The analysis is
549 \fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP:
555 $ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=300 dot\-product.s
591 [0]   \- JALU0
592 [1]   \- JALU1
593 [2]   \- JDiv
594 [3]   \- JFPA
595 [4]   \- JFPM
596 [5]   \- JFPU0
597 [6]   \- JFPU1
598 [7]   \- JLAGU
599 [8]   \- JMul
600 [9]   \- JSAGU
601 [10]  \- JSTC
602 [11]  \- JVALU0
603 [12]  \- JVALU1
604 [13]  \- JVIMUL
609 …\-      \-      \-     2.00   1.00   2.00   1.00    \-      \-      \-      \-      \-      \-    …
613 …\-      \-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-  …
614 …\-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-  …
615 …\-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-  …
621 According to this report, the dot\-product kernel has been executed 300 times,
632 to the out\-of\-order backend every simulated cycle. For processors with an
633 in\-order backend, \fIDispatchWidth\fP is the maximum number of micro opcodes issued
645 In the absence of loop\-carried data dependencies, the observed IPC tends to a
651 field is an indicator of a performance issue. In the absence of loop\-carried
668 are no loop\-carried dependencies, the observed \fIuOps Per Cycle\fP is expected to
688 \fI\-show\-encoding\fP is specified.
690 Below is an example of \fI\-show\-encoding\fP output for the dot\-product kernel:
725 (JFPU1 \- floating point pipeline #1), consuming an average of 1 resource cycle
726 per iteration.  Note that on AMD Jaguar, vector floating\-point multiply can
727 only be issued to pipeline JFPU1, while horizontal floating\-point additions can
738 command line option \fB\-timeline\fP\&.  As instructions transition through the
753 \- : Instruction executed, waiting to be retired.
756 Below is the timeline view for a subset of the dot\-product example located in
757 \fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP and processed by
758 \fBllvm\-mca\fP using the following command:
764 $ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=3 \-timeline dot\-product…
781 [1,0]     .DeeE\-\-\-\-\-R    .   vmulps   %xmm0, %xmm1, %xmm2
782 [1,1]     . D=eeeE\-\-\-R   .   vhaddps  %xmm2, %xmm2, %xmm3
784 [2,0]     .  DeeE\-\-\-\-\-R  .   vmulps   %xmm0, %xmm1, %xmm2
813 sub\-optimal usage of hardware resources.
818 example was generated using 3 iterations: \fB\-iterations=3\fP, the iteration
819 indices range from 0\-2 inclusively.
843 There is a gap of 5 cycles between the write\-back stage and the retire event.
853 In the dot\-product example, there are anti\-dependencies introduced by
861 instructions measured. Note that \fBllvm\-mca\fP, by default, assumes at
871 cycles spent in the queue tends to be larger (i.e., more than 1\-3cy),
875 The \fB\-bottleneck\-analysis\fP command line option enables the analysis of
882 Below is an example of \fB\-bottleneck\-analysis\fP output generated by
883 \fBllvm\-mca\fP for 500 iterations of the dot\-product example on btver2.
892   \- JFPA  [ 47.77% ]
893   \- JFPU0  [ 47.77% ]
895   \- Register Dependencies [ 0.30% ]
896   \- Memory Dependencies   [ 0.00% ]
901  +\-\-\-\-< 2.    vhaddps %xmm3, %xmm3, %xmm4
906 …+\-\-\-\-> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability…
907  +\-\-\-\-> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
911 …+\-\-\-\-> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability…
931 Bottleneck analysis is currently not supported for processors with an in\-order
935 The \fB\-all\-stats\fP command line option enables extra statistics and performance
939 Below is an example of \fB\-all\-stats\fP output generated by  \fBllvm\-mca\fP
940 for 300 iterations of the dot\-product example discussed in the previous
948 RAT     \- Register unavailable:                      0
949 RCU     \- Retire tokens unavailable:                 0
950 SCHEDQ  \- Scheduler full:                            272  (44.6%)
951 LQ      \- Load queue full:                           0
952 SQ      \- Store queue full:                          0
953 GROUP   \- Static restrictions on the dispatch group: 0
956 Dispatch Logic \- number of cycles where we saw N micro opcodes dispatched:
963 Schedulers \- number of cycles where we saw N micro opcodes issued:
981 Retire Control Unit \- number of cycles where we saw N instructions retired:
996 *  Register File #1 \-\- JFpuPRF:
1001 *  Register File #2 \-\- JIntegerPRF:
1018 \fB\-all\-stats\fP or \fB\-dispatch\-stats\fP\&.
1031 JALU01 \- A scheduler for ALU instructions.
1033 JFPU01 \- A scheduler floating point operations.
1035 JLSAGU \- A scheduler for address generation.
1038 The dot\-product is a kernel of three floating point instructions (a vector
1043 sub\-optimal usage of hardware resources.  Sometimes, resource pressure can be
1047 statistics are displayed by using the command option \fB\-all\-stats\fP or
1048 \fB\-scheduler\-stats\fP\&.
1055 \fB\-all\-stats\fP or \fB\-retire\-stats\fP\&.
1059 Jaguar, there are two register files, one for floating\-point registers (JFpuPRF)
1061 instructions processed, there were 900 mappings created.  Since this dot\-product
1067 displayed by using the command option \fB\-all\-stats\fP or
1068 \fB\-register\-file\-stats\fP\&.
1075 \fBllvm\-mca\fP, as well as the functional units involved in the process.
1090 The in\-order pipeline implements the following sequence of stages:
1094 \fBllvm\-mca\fP assumes that instructions have all been decoded and placed
1097 diagnosed. Also, \fBllvm\-mca\fP does not model branch prediction.
1121 the processor. \fBllvm\-mca\fP uses that information to initialize register
1124 \fB\-register\-file\-size\fP\&.  A value of zero for this option means \fIunbounded\fP\&. By
1129 number of micro\-opcodes specified for that instruction by the target scheduling
1131 instructions that are \(dqin\-flight\(dq, and retiring them in program order.  The
1136 entries. \fBllvm\-mca\fP queries the scheduling model to determine the set
1144 execution and may be issued (potentially out\-of\-order) for execution.
1145 Instruction latencies are computed by \fBllvm\-mca\fP with the help of the
1148 \fBllvm\-mca\fP\(aqs scheduler is designed to simulate multiple processor
1156 round\-robin selector to guarantee that resource usage is uniformly distributed
1159 \fBllvm\-mca\fP\(aqs scheduler internally groups instructions into three sets:
1176 .SS Write\-Back and Retire Stage
1179 instructions wait until they reach the write\-back stage.  At that point, they
1190 To simulate an out\-of\-order execution of memory operations, \fBllvm\-mca\fP
1195 specify flags \fB\-lqueue\fP and \fB\-squeue\fP to limit the number of entries in the
1214 (\fI\-noalias=true\fP) store operations.  Under this assumption, younger loads are
1219 Note that, in the case of write\-combining memory, rule 3 could be relaxed to
1220 allow reordering of non\-aliasing store operations.  That being said, at the
1221 moment, there is no way to further relax the memory model (\fB\-noalias\fP is the
1223 type (e.g., write\-back, write\-combining, write\-through; etc.) and consequently
1229 The LSUnit does not know when store\-to\-load forwarding may occur.
1239 loads, the scheduling model provides an \(dqoptimistic\(dq load\-to\-use latency (which
1240 usually matches the load\-to\-use latency for when there is a hit in the L1D).
1242 \fBllvm\-mca\fP does not (on its own) know about serializing operations or
1243 memory\-barrier like instructions.  The LSUnit used to conservatively use an
1245 determine whether an instruction should be treated as a memory\-barrier. This was
1265 A store may not pass a previous load (regardless of \fB\-noalias\fP).
1271 A load may not pass a previous store unless \fB\-noalias\fP is set.
1275 .SS In\-order Issue and Execute
1277 In\-order processors are modelled as a single \fBInOrderIssueStage\fP stage. It
1284 retire. \fBllvm\-mca\fP ensures that writes are committed in\-order. However,
1285 an instruction is allowed to commit writes and retire out\-of\-order if
1290 scheduling model, \fBllvm\-mca\fP isn\(aqt always able to simulate them
1296 hazards that \fBllvm\-mca\fP has no way of knowing about).
1298 \fBllvm\-mca\fP comes with one generic and multiple target specific
1299 CustomBehaviour classes. The generic class will be used if the \fB\-disable\-cb\fP
1302 class is only a part of the in\-order pipeline, but there are plans to add it
1303 to the out\-of\-order pipeline in the future.
1335 LMUL to the scheduling class of the pseudo\-instruction that describes
1339 \fBllvm\-mca\fP comes with several Views such as the Timeline View and
1341 targets. If you wish to add a new View to \fBllvm\-mca\fP and it does not
1344 \fI/tools/llvm\-mca/View/\fP directory. However, if your new View is target specific
1358 the \fI\-disable\-cb\fP flag is used.
1360 Enabling these custom Views does not affect the non\-custom (generic) Views.
1366 2003-2023, LLVM Project
In current file

In project "undefined"

On Google