xref: /linux/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst (revision c43267e6794a36013fd495a4d81bf7f748fe4615)
1f5caf26fSBesar Wicaksono=====================================================================
2f5caf26fSBesar WicaksonoNVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
3f5caf26fSBesar Wicaksono=====================================================================
4f5caf26fSBesar Wicaksono
5f5caf26fSBesar WicaksonoThe NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
6f5caf26fSBesar Wicaksonometrics like memory bandwidth, latency, and utilization:
7f5caf26fSBesar Wicaksono
8f5caf26fSBesar Wicaksono* Unified Coherence Fabric (UCF)
9bf585ba1SBesar Wicaksono* PCIE
103dd73022SBesar Wicaksono* PCIE-TGT
11429b7638SBesar Wicaksono* CPU Memory (CMEM) Latency
12*2f89b7f7SBesar Wicaksono* NVLink-C2C
13*2f89b7f7SBesar Wicaksono* NV-CLink
14*2f89b7f7SBesar Wicaksono* NV-DLink
15f5caf26fSBesar Wicaksono
16f5caf26fSBesar WicaksonoPMU Driver
17f5caf26fSBesar Wicaksono----------
18f5caf26fSBesar Wicaksono
19f5caf26fSBesar WicaksonoThe PMU driver describes the available events and configuration of each PMU in
20f5caf26fSBesar Wicaksonosysfs. Please see the sections below to get the sysfs path of each PMU. Like
21f5caf26fSBesar Wicaksonoother uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
22f5caf26fSBesar Wicaksonothe CPU id used to handle the PMU event. There is also "associated_cpus"
23f5caf26fSBesar Wicaksonosysfs attribute, which contains a list of CPUs associated with the PMU instance.
24f5caf26fSBesar Wicaksono
25f5caf26fSBesar WicaksonoUCF PMU
26f5caf26fSBesar Wicaksono-------
27f5caf26fSBesar Wicaksono
28f5caf26fSBesar WicaksonoThe Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
29f5caf26fSBesar Wicaksonodistributed cache, last level for CPU Memory and CXL Memory, and cache coherent
30f5caf26fSBesar Wicaksonointerconnect that supports hardware coherence across multiple coherently caching
31f5caf26fSBesar Wicaksonoagents, including:
32f5caf26fSBesar Wicaksono
33f5caf26fSBesar Wicaksono  * CPU clusters
34f5caf26fSBesar Wicaksono  * GPU
35f5caf26fSBesar Wicaksono  * PCIe Ordering Controller Unit (OCU)
36f5caf26fSBesar Wicaksono  * Other IO-coherent requesters
37f5caf26fSBesar Wicaksono
38f5caf26fSBesar WicaksonoThe events and configuration options of this PMU device are described in sysfs,
39f5caf26fSBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
40f5caf26fSBesar Wicaksono
41f5caf26fSBesar WicaksonoSome of the events available in this PMU can be used to measure bandwidth and
42f5caf26fSBesar Wicaksonoutilization:
43f5caf26fSBesar Wicaksono
44f5caf26fSBesar Wicaksono  * slc_access_rd: count the number of read requests to SLC.
45f5caf26fSBesar Wicaksono  * slc_access_wr: count the number of write requests to SLC.
46f5caf26fSBesar Wicaksono  * slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
47f5caf26fSBesar Wicaksono  * slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
48f5caf26fSBesar Wicaksono  * mem_access_rd: count the number of read requests to local or remote memory.
49f5caf26fSBesar Wicaksono  * mem_access_wr: count the number of write requests to local or remote memory.
50f5caf26fSBesar Wicaksono  * mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
51f5caf26fSBesar Wicaksono  * mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
52f5caf26fSBesar Wicaksono  * cycles: counts the UCF cycles.
53f5caf26fSBesar Wicaksono
54f5caf26fSBesar WicaksonoThe average bandwidth is calculated as::
55f5caf26fSBesar Wicaksono
56f5caf26fSBesar Wicaksono   AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
57f5caf26fSBesar Wicaksono   AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
58f5caf26fSBesar Wicaksono   AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
59f5caf26fSBesar Wicaksono   AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
60f5caf26fSBesar Wicaksono
61f5caf26fSBesar WicaksonoThe average request rate is calculated as::
62f5caf26fSBesar Wicaksono
63f5caf26fSBesar Wicaksono   AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
64f5caf26fSBesar Wicaksono   AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
65f5caf26fSBesar Wicaksono   AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
66f5caf26fSBesar Wicaksono   AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
67f5caf26fSBesar Wicaksono
68f5caf26fSBesar WicaksonoMore details about what other events are available can be found in Tegra410 SoC
69f5caf26fSBesar Wicaksonotechnical reference manual.
70f5caf26fSBesar Wicaksono
71f5caf26fSBesar WicaksonoThe events can be filtered based on source or destination. The source filter
72f5caf26fSBesar Wicaksonoindicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
73f5caf26fSBesar Wicaksonoremote socket. The destination filter specifies the destination memory type,
74f5caf26fSBesar Wicaksonoe.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
75f5caf26fSBesar Wicaksonolocal/remote classification of the destination filter is based on the home
76f5caf26fSBesar Wicaksonosocket of the address, not where the data actually resides. The available
77f5caf26fSBesar Wicaksonofilters are described in
78f5caf26fSBesar Wicaksono/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
79f5caf26fSBesar Wicaksono
80f5caf26fSBesar WicaksonoThe list of UCF PMU event filters:
81f5caf26fSBesar Wicaksono
82f5caf26fSBesar Wicaksono* Source filter:
83f5caf26fSBesar Wicaksono
84f5caf26fSBesar Wicaksono  * src_loc_cpu: if set, count events from local CPU
85f5caf26fSBesar Wicaksono  * src_loc_noncpu: if set, count events from local non-CPU device
86f5caf26fSBesar Wicaksono  * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
87f5caf26fSBesar Wicaksono
88f5caf26fSBesar Wicaksono* Destination filter:
89f5caf26fSBesar Wicaksono
90f5caf26fSBesar Wicaksono  * dst_loc_cmem: if set, count events to local system memory (CMEM) address
91f5caf26fSBesar Wicaksono  * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
92f5caf26fSBesar Wicaksono  * dst_loc_other: if set, count events to local CXL memory address
93f5caf26fSBesar Wicaksono  * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
94f5caf26fSBesar Wicaksono
95f5caf26fSBesar WicaksonoIf the source is not specified, the PMU will count events from all sources. If
96f5caf26fSBesar Wicaksonothe destination is not specified, the PMU will count events to all destinations.
97f5caf26fSBesar Wicaksono
98f5caf26fSBesar WicaksonoExample usage:
99f5caf26fSBesar Wicaksono
100f5caf26fSBesar Wicaksono* Count event id 0x0 in socket 0 from all sources and to all destinations::
101f5caf26fSBesar Wicaksono
102f5caf26fSBesar Wicaksono    perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
103f5caf26fSBesar Wicaksono
104f5caf26fSBesar Wicaksono* Count event id 0x0 in socket 0 with source filter = local CPU and destination
105f5caf26fSBesar Wicaksono  filter = local system memory (CMEM)::
106f5caf26fSBesar Wicaksono
107f5caf26fSBesar Wicaksono    perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
108f5caf26fSBesar Wicaksono
109f5caf26fSBesar Wicaksono* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
110f5caf26fSBesar Wicaksono  destination filter = remote memory::
111f5caf26fSBesar Wicaksono
112f5caf26fSBesar Wicaksono    perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
113bf585ba1SBesar Wicaksono
114bf585ba1SBesar WicaksonoPCIE PMU
115bf585ba1SBesar Wicaksono--------
116bf585ba1SBesar Wicaksono
117bf585ba1SBesar WicaksonoThis PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
118bf585ba1SBesar Wicaksonothe memory subsystem. It monitors all read/write traffic from the root port(s)
119bf585ba1SBesar Wicaksonoor a particular BDF in a PCIE RC to local or remote memory. There is one PMU per
120bf585ba1SBesar WicaksonoPCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
121bf585ba1SBesar Wicaksonoup to 8 root ports. The traffic from each root port can be filtered using RP or
122bf585ba1SBesar WicaksonoBDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
123bf585ba1SBesar Wicaksonocapture traffic from all RPs. Please see below for more details.
124bf585ba1SBesar Wicaksono
125bf585ba1SBesar WicaksonoThe events and configuration options of this PMU device are described in sysfs,
126bf585ba1SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
127bf585ba1SBesar Wicaksono
128bf585ba1SBesar WicaksonoThe events in this PMU can be used to measure bandwidth, utilization, and
129bf585ba1SBesar Wicaksonolatency:
130bf585ba1SBesar Wicaksono
131bf585ba1SBesar Wicaksono  * rd_req: count the number of read requests by PCIE device.
132bf585ba1SBesar Wicaksono  * wr_req: count the number of write requests by PCIE device.
133bf585ba1SBesar Wicaksono  * rd_bytes: count the number of bytes transferred by rd_req.
134bf585ba1SBesar Wicaksono  * wr_bytes: count the number of bytes transferred by wr_req.
135bf585ba1SBesar Wicaksono  * rd_cum_outs: count outstanding rd_req each cycle.
136bf585ba1SBesar Wicaksono  * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
137bf585ba1SBesar Wicaksono
138bf585ba1SBesar WicaksonoThe average bandwidth is calculated as::
139bf585ba1SBesar Wicaksono
140bf585ba1SBesar Wicaksono   AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
141bf585ba1SBesar Wicaksono   AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
142bf585ba1SBesar Wicaksono
143bf585ba1SBesar WicaksonoThe average request rate is calculated as::
144bf585ba1SBesar Wicaksono
145bf585ba1SBesar Wicaksono   AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
146bf585ba1SBesar Wicaksono   AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
147bf585ba1SBesar Wicaksono
148bf585ba1SBesar Wicaksono
149bf585ba1SBesar WicaksonoThe average latency is calculated as::
150bf585ba1SBesar Wicaksono
151bf585ba1SBesar Wicaksono   FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
152bf585ba1SBesar Wicaksono   AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
153bf585ba1SBesar Wicaksono   AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
154bf585ba1SBesar Wicaksono
155bf585ba1SBesar WicaksonoThe PMU events can be filtered based on the traffic source and destination.
156bf585ba1SBesar WicaksonoThe source filter indicates the PCIE devices that will be monitored. The
157bf585ba1SBesar Wicaksonodestination filter specifies the destination memory type, e.g. local system
158bf585ba1SBesar Wicaksonomemory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
159bf585ba1SBesar Wicaksonoclassification of the destination filter is based on the home socket of the
160bf585ba1SBesar Wicaksonoaddress, not where the data actually resides. These filters can be found in
161bf585ba1SBesar Wicaksono/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
162bf585ba1SBesar Wicaksono
163bf585ba1SBesar WicaksonoThe list of event filters:
164bf585ba1SBesar Wicaksono
165bf585ba1SBesar Wicaksono* Source filter:
166bf585ba1SBesar Wicaksono
167bf585ba1SBesar Wicaksono  * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
168bf585ba1SBesar Wicaksono    bitmask represents the RP index in the RC. If the bit is set, all devices under
169bf585ba1SBesar Wicaksono    the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
170bf585ba1SBesar Wicaksono    devices in root port 0 to 3.
171bf585ba1SBesar Wicaksono  * src_bdf: the BDF that will be monitored. This is a 16-bit value that
172bf585ba1SBesar Wicaksono    follows formula: (bus << 8) + (device << 3) + (function). For example, the
173bf585ba1SBesar Wicaksono    value of BDF 27:01.1 is 0x2781.
174bf585ba1SBesar Wicaksono  * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
175bf585ba1SBesar Wicaksono    "src_bdf" is used to filter the traffic.
176bf585ba1SBesar Wicaksono
177bf585ba1SBesar Wicaksono  Note that Root-Port and BDF filters are mutually exclusive and the PMU in
178bf585ba1SBesar Wicaksono  each RC can only have one BDF filter for the whole counters. If BDF filter
179bf585ba1SBesar Wicaksono  is enabled, the BDF filter value will be applied to all events.
180bf585ba1SBesar Wicaksono
181bf585ba1SBesar Wicaksono* Destination filter:
182bf585ba1SBesar Wicaksono
183bf585ba1SBesar Wicaksono  * dst_loc_cmem: if set, count events to local system memory (CMEM) address
184bf585ba1SBesar Wicaksono  * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
185bf585ba1SBesar Wicaksono  * dst_loc_pcie_p2p: if set, count events to local PCIE peer address
186bf585ba1SBesar Wicaksono  * dst_loc_pcie_cxl: if set, count events to local CXL memory address
187bf585ba1SBesar Wicaksono  * dst_rem: if set, count events to remote memory address
188bf585ba1SBesar Wicaksono
189bf585ba1SBesar WicaksonoIf the source filter is not specified, the PMU will count events from all root
190bf585ba1SBesar Wicaksonoports. If the destination filter is not specified, the PMU will count events
191bf585ba1SBesar Wicaksonoto all destinations.
192bf585ba1SBesar Wicaksono
193bf585ba1SBesar WicaksonoExample usage:
194bf585ba1SBesar Wicaksono
195bf585ba1SBesar Wicaksono* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
196bf585ba1SBesar Wicaksono  destinations::
197bf585ba1SBesar Wicaksono
198bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
199bf585ba1SBesar Wicaksono
200bf585ba1SBesar Wicaksono* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
201bf585ba1SBesar Wicaksono  targeting just local CMEM of socket 0::
202bf585ba1SBesar Wicaksono
203bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
204bf585ba1SBesar Wicaksono
205bf585ba1SBesar Wicaksono* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
206bf585ba1SBesar Wicaksono  destinations::
207bf585ba1SBesar Wicaksono
208bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
209bf585ba1SBesar Wicaksono
210bf585ba1SBesar Wicaksono* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
211bf585ba1SBesar Wicaksono  targeting just local CMEM of socket 1::
212bf585ba1SBesar Wicaksono
213bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
214bf585ba1SBesar Wicaksono
215bf585ba1SBesar Wicaksono* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
216bf585ba1SBesar Wicaksono  destinations::
217bf585ba1SBesar Wicaksono
218bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
219bf585ba1SBesar Wicaksono
2203dd73022SBesar Wicaksono.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
2213dd73022SBesar Wicaksono
2223dd73022SBesar WicaksonoMapping the RC# to lspci segment number
2233dd73022SBesar Wicaksono~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2243dd73022SBesar Wicaksono
225bf585ba1SBesar WicaksonoMapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
226bf585ba1SBesar WicaksonoDesignated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
227bf585ba1SBesar Wicaksonofor each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
228bf585ba1SBesar Wicaksonocontains the following information to map PCIE devices under the RP back to its RC# :
229bf585ba1SBesar Wicaksono
230bf585ba1SBesar Wicaksono  - Bus# (byte 0xc) : bus number as reported by the lspci output
231bf585ba1SBesar Wicaksono  - Segment# (byte 0xd) : segment number as reported by the lspci output
232bf585ba1SBesar Wicaksono  - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
233bf585ba1SBesar Wicaksono  - RC# (byte 0xf): root complex number associated with the RP
234bf585ba1SBesar Wicaksono  - Socket# (byte 0x10): socket number associated with the RP
235bf585ba1SBesar Wicaksono
236bf585ba1SBesar WicaksonoExample script for mapping lspci BDF to RC# and socket#::
237bf585ba1SBesar Wicaksono
238bf585ba1SBesar Wicaksono  #!/bin/bash
239bf585ba1SBesar Wicaksono  while read bdf rest; do
240bf585ba1SBesar Wicaksono    dvsec4_reg=$(lspci -vv -s $bdf | awk '
241bf585ba1SBesar Wicaksono      /Designated Vendor-Specific: Vendor=10de ID=0004/ {
242bf585ba1SBesar Wicaksono        match($0, /\[([0-9a-fA-F]+)/, arr);
243bf585ba1SBesar Wicaksono        print "0x" arr[1];
244bf585ba1SBesar Wicaksono        exit
245bf585ba1SBesar Wicaksono      }
246bf585ba1SBesar Wicaksono    ')
247bf585ba1SBesar Wicaksono    if [ -n "$dvsec4_reg" ]; then
248bf585ba1SBesar Wicaksono      bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
249bf585ba1SBesar Wicaksono      segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
250bf585ba1SBesar Wicaksono      rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
251bf585ba1SBesar Wicaksono      rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
252bf585ba1SBesar Wicaksono      socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
253bf585ba1SBesar Wicaksono      echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
254bf585ba1SBesar Wicaksono    fi
255bf585ba1SBesar Wicaksono  done < <(lspci -d 10de:)
256bf585ba1SBesar Wicaksono
257bf585ba1SBesar WicaksonoExample output::
258bf585ba1SBesar Wicaksono
259bf585ba1SBesar Wicaksono  0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
260bf585ba1SBesar Wicaksono  0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
261bf585ba1SBesar Wicaksono  0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
262bf585ba1SBesar Wicaksono  0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
263bf585ba1SBesar Wicaksono  0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
264bf585ba1SBesar Wicaksono  0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
265bf585ba1SBesar Wicaksono  0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
266bf585ba1SBesar Wicaksono  0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
267bf585ba1SBesar Wicaksono  0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
268bf585ba1SBesar Wicaksono  0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
269bf585ba1SBesar Wicaksono  0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
270bf585ba1SBesar Wicaksono  0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
271bf585ba1SBesar Wicaksono  000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
272bf585ba1SBesar Wicaksono  000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
273bf585ba1SBesar Wicaksono  000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
274bf585ba1SBesar Wicaksono  000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
275bf585ba1SBesar Wicaksono  000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
276bf585ba1SBesar Wicaksono  000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
277bf585ba1SBesar Wicaksono  000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
278bf585ba1SBesar Wicaksono  000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
279bf585ba1SBesar Wicaksono  000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
2803dd73022SBesar Wicaksono
2813dd73022SBesar WicaksonoPCIE-TGT PMU
2823dd73022SBesar Wicaksono------------
2833dd73022SBesar Wicaksono
2843dd73022SBesar WicaksonoThis PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
2853dd73022SBesar Wicaksonothe memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges.
2863dd73022SBesar WicaksonoThere is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can
2873dd73022SBesar Wicaksonohave up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU
2883dd73022SBesar Wicaksonoprovides RP filter to count PCIE BAR traffic to each RP and address filter to
2893dd73022SBesar Wicaksonocount access to PCIE BAR or CXL HDM ranges. The details of the filters are
2903dd73022SBesar Wicaksonodescribed in the following sections.
2913dd73022SBesar Wicaksono
2923dd73022SBesar WicaksonoMapping the RC# to lspci segment number is similar to the PCIE PMU. Please see
2933dd73022SBesar Wicaksono:ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
2943dd73022SBesar Wicaksono
2953dd73022SBesar WicaksonoThe events and configuration options of this PMU device are available in sysfs,
2963dd73022SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
2973dd73022SBesar Wicaksono
2983dd73022SBesar WicaksonoThe events in this PMU can be used to measure bandwidth and utilization:
2993dd73022SBesar Wicaksono
3003dd73022SBesar Wicaksono  * rd_req: count the number of read requests to PCIE.
3013dd73022SBesar Wicaksono  * wr_req: count the number of write requests to PCIE.
3023dd73022SBesar Wicaksono  * rd_bytes: count the number of bytes transferred by rd_req.
3033dd73022SBesar Wicaksono  * wr_bytes: count the number of bytes transferred by wr_req.
3043dd73022SBesar Wicaksono  * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
3053dd73022SBesar Wicaksono
3063dd73022SBesar WicaksonoThe average bandwidth is calculated as::
3073dd73022SBesar Wicaksono
3083dd73022SBesar Wicaksono   AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
3093dd73022SBesar Wicaksono   AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
3103dd73022SBesar Wicaksono
3113dd73022SBesar WicaksonoThe average request rate is calculated as::
3123dd73022SBesar Wicaksono
3133dd73022SBesar Wicaksono   AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
3143dd73022SBesar Wicaksono   AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
3153dd73022SBesar Wicaksono
3163dd73022SBesar WicaksonoThe PMU events can be filtered based on the destination root port or target
3173dd73022SBesar Wicaksonoaddress range. Filtering based on RP is only available for PCIE BAR traffic.
3183dd73022SBesar WicaksonoAddress filter works for both PCIE BAR and CXL HDM ranges. These filters can be
3193dd73022SBesar Wicaksonofound in sysfs, see
3203dd73022SBesar Wicaksono/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
3213dd73022SBesar Wicaksono
3223dd73022SBesar WicaksonoDestination filter settings:
3233dd73022SBesar Wicaksono
3243dd73022SBesar Wicaksono* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
3253dd73022SBesar Wicaksono  corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
3263dd73022SBesar Wicaksono  only available for PCIE BAR traffic.
3273dd73022SBesar Wicaksono* dst_addr_base: BAR or CXL HDM filter base address.
3283dd73022SBesar Wicaksono* dst_addr_mask: BAR or CXL HDM filter address mask.
3293dd73022SBesar Wicaksono* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
3303dd73022SBesar Wicaksono  address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
3313dd73022SBesar Wicaksono  the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
3323dd73022SBesar Wicaksono  to determine if the traffic destination address falls within the filter range::
3333dd73022SBesar Wicaksono
3343dd73022SBesar Wicaksono    (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
3353dd73022SBesar Wicaksono
3363dd73022SBesar Wicaksono  If the comparison succeeds, then the event will be counted.
3373dd73022SBesar Wicaksono
3383dd73022SBesar WicaksonoIf the destination filter is not specified, the RP filter will be configured by default
3393dd73022SBesar Wicaksonoto count PCIE BAR traffic to all root ports.
3403dd73022SBesar Wicaksono
3413dd73022SBesar WicaksonoExample usage:
3423dd73022SBesar Wicaksono
3433dd73022SBesar Wicaksono* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
3443dd73022SBesar Wicaksono
3453dd73022SBesar Wicaksono    perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
3463dd73022SBesar Wicaksono
3473dd73022SBesar Wicaksono* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
3483dd73022SBesar Wicaksono  0x10000 to 0x100FF on socket 0's PCIE RC-1::
3493dd73022SBesar Wicaksono
3503dd73022SBesar Wicaksono    perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
351429b7638SBesar Wicaksono
352429b7638SBesar WicaksonoCPU Memory (CMEM) Latency PMU
353429b7638SBesar Wicaksono-----------------------------
354429b7638SBesar Wicaksono
355429b7638SBesar WicaksonoThis PMU monitors latency events of memory read requests from the edge of the
356429b7638SBesar WicaksonoUnified Coherence Fabric (UCF) to local CPU DRAM:
357429b7638SBesar Wicaksono
358429b7638SBesar Wicaksono  * RD_REQ counters: count read requests (32B per request).
359429b7638SBesar Wicaksono  * RD_CUM_OUTS counters: accumulated outstanding request counter, which track
360429b7638SBesar Wicaksono    how many cycles the read requests are in flight.
361429b7638SBesar Wicaksono  * CYCLES counter: counts the number of elapsed cycles.
362429b7638SBesar Wicaksono
363429b7638SBesar WicaksonoThe average latency is calculated as::
364429b7638SBesar Wicaksono
365429b7638SBesar Wicaksono   FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
366429b7638SBesar Wicaksono   AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
367429b7638SBesar Wicaksono   AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
368429b7638SBesar Wicaksono
369429b7638SBesar WicaksonoThe events and configuration options of this PMU device are described in sysfs,
370429b7638SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
371429b7638SBesar Wicaksono
372429b7638SBesar WicaksonoExample usage::
373429b7638SBesar Wicaksono
374429b7638SBesar Wicaksono  perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
375*2f89b7f7SBesar Wicaksono
376*2f89b7f7SBesar WicaksonoNVLink-C2C PMU
377*2f89b7f7SBesar Wicaksono--------------
378*2f89b7f7SBesar Wicaksono
379*2f89b7f7SBesar WicaksonoThis PMU monitors latency events of memory read/write requests that pass through
380*2f89b7f7SBesar Wicaksonothe NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
381*2f89b7f7SBesar Wicaksonoin this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
382*2f89b7f7SBesar Wicaksono
383*2f89b7f7SBesar WicaksonoThe events and configuration options of this PMU device are available in sysfs,
384*2f89b7f7SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
385*2f89b7f7SBesar Wicaksono
386*2f89b7f7SBesar WicaksonoThe list of events:
387*2f89b7f7SBesar Wicaksono
388*2f89b7f7SBesar Wicaksono  * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
389*2f89b7f7SBesar Wicaksono  * IN_RD_REQ: the number of incoming read requests.
390*2f89b7f7SBesar Wicaksono  * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
391*2f89b7f7SBesar Wicaksono  * IN_WR_REQ: the number of incoming write requests.
392*2f89b7f7SBesar Wicaksono  * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
393*2f89b7f7SBesar Wicaksono  * OUT_RD_REQ: the number of outgoing read requests.
394*2f89b7f7SBesar Wicaksono  * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
395*2f89b7f7SBesar Wicaksono  * OUT_WR_REQ: the number of outgoing write requests.
396*2f89b7f7SBesar Wicaksono  * CYCLES: NVLink-C2C interface cycle counts.
397*2f89b7f7SBesar Wicaksono
398*2f89b7f7SBesar WicaksonoThe incoming events count the reads/writes from remote device to the SoC.
399*2f89b7f7SBesar WicaksonoThe outgoing events count the reads/writes from the SoC to remote device.
400*2f89b7f7SBesar Wicaksono
401*2f89b7f7SBesar WicaksonoThe sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
402*2f89b7f7SBesar Wicaksonocontains the information about the connected device.
403*2f89b7f7SBesar Wicaksono
404*2f89b7f7SBesar WicaksonoWhen the C2C interface is connected to GPU(s), the user can use the
405*2f89b7f7SBesar Wicaksono"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
406*2f89b7f7SBesar Wicaksonoindex, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
407*2f89b7f7SBesar WicaksonoThe PMU will monitor all GPUs by default if not specified.
408*2f89b7f7SBesar Wicaksono
409*2f89b7f7SBesar WicaksonoWhen connected to another SoC, only the read events are available.
410*2f89b7f7SBesar Wicaksono
411*2f89b7f7SBesar WicaksonoThe events can be used to calculate the average latency of the read/write requests::
412*2f89b7f7SBesar Wicaksono
413*2f89b7f7SBesar Wicaksono   C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
414*2f89b7f7SBesar Wicaksono
415*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
416*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
417*2f89b7f7SBesar Wicaksono
418*2f89b7f7SBesar Wicaksono   IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
419*2f89b7f7SBesar Wicaksono   IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
420*2f89b7f7SBesar Wicaksono
421*2f89b7f7SBesar Wicaksono   OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
422*2f89b7f7SBesar Wicaksono   OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
423*2f89b7f7SBesar Wicaksono
424*2f89b7f7SBesar Wicaksono   OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
425*2f89b7f7SBesar Wicaksono   OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
426*2f89b7f7SBesar Wicaksono
427*2f89b7f7SBesar WicaksonoExample usage:
428*2f89b7f7SBesar Wicaksono
429*2f89b7f7SBesar Wicaksono  * Count incoming traffic from all GPUs connected via NVLink-C2C::
430*2f89b7f7SBesar Wicaksono
431*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
432*2f89b7f7SBesar Wicaksono
433*2f89b7f7SBesar Wicaksono  * Count incoming traffic from GPU 0 connected via NVLink-C2C::
434*2f89b7f7SBesar Wicaksono
435*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
436*2f89b7f7SBesar Wicaksono
437*2f89b7f7SBesar Wicaksono  * Count incoming traffic from GPU 1 connected via NVLink-C2C::
438*2f89b7f7SBesar Wicaksono
439*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
440*2f89b7f7SBesar Wicaksono
441*2f89b7f7SBesar Wicaksono  * Count outgoing traffic to all GPUs connected via NVLink-C2C::
442*2f89b7f7SBesar Wicaksono
443*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
444*2f89b7f7SBesar Wicaksono
445*2f89b7f7SBesar Wicaksono  * Count outgoing traffic to GPU 0 connected via NVLink-C2C::
446*2f89b7f7SBesar Wicaksono
447*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
448*2f89b7f7SBesar Wicaksono
449*2f89b7f7SBesar Wicaksono  * Count outgoing traffic to GPU 1 connected via NVLink-C2C::
450*2f89b7f7SBesar Wicaksono
451*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
452*2f89b7f7SBesar Wicaksono
453*2f89b7f7SBesar WicaksonoNV-CLink PMU
454*2f89b7f7SBesar Wicaksono------------
455*2f89b7f7SBesar Wicaksono
456*2f89b7f7SBesar WicaksonoThis PMU monitors latency events of memory read requests that pass through
457*2f89b7f7SBesar Wicaksonothe NV-CLINK interface. Bandwidth events are not available in this PMU.
458*2f89b7f7SBesar WicaksonoIn Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
459*2f89b7f7SBesar WicaksonoSoC and this PMU only counts read traffic.
460*2f89b7f7SBesar Wicaksono
461*2f89b7f7SBesar WicaksonoThe events and configuration options of this PMU device are available in sysfs,
462*2f89b7f7SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
463*2f89b7f7SBesar Wicaksono
464*2f89b7f7SBesar WicaksonoThe list of events:
465*2f89b7f7SBesar Wicaksono
466*2f89b7f7SBesar Wicaksono  * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
467*2f89b7f7SBesar Wicaksono  * IN_RD_REQ: the number of incoming read requests.
468*2f89b7f7SBesar Wicaksono  * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
469*2f89b7f7SBesar Wicaksono  * OUT_RD_REQ: the number of outgoing read requests.
470*2f89b7f7SBesar Wicaksono  * CYCLES: NV-CLINK interface cycle counts.
471*2f89b7f7SBesar Wicaksono
472*2f89b7f7SBesar WicaksonoThe incoming events count the reads from remote device to the SoC.
473*2f89b7f7SBesar WicaksonoThe outgoing events count the reads from the SoC to remote device.
474*2f89b7f7SBesar Wicaksono
475*2f89b7f7SBesar WicaksonoThe events can be used to calculate the average latency of the read requests::
476*2f89b7f7SBesar Wicaksono
477*2f89b7f7SBesar Wicaksono   CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
478*2f89b7f7SBesar Wicaksono
479*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
480*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
481*2f89b7f7SBesar Wicaksono
482*2f89b7f7SBesar Wicaksono   OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
483*2f89b7f7SBesar Wicaksono   OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
484*2f89b7f7SBesar Wicaksono
485*2f89b7f7SBesar WicaksonoExample usage:
486*2f89b7f7SBesar Wicaksono
487*2f89b7f7SBesar Wicaksono  * Count incoming read traffic from remote SoC connected via NV-CLINK::
488*2f89b7f7SBesar Wicaksono
489*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
490*2f89b7f7SBesar Wicaksono
491*2f89b7f7SBesar Wicaksono  * Count outgoing read traffic to remote SoC connected via NV-CLINK::
492*2f89b7f7SBesar Wicaksono
493*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
494*2f89b7f7SBesar Wicaksono
495*2f89b7f7SBesar WicaksonoNV-DLink PMU
496*2f89b7f7SBesar Wicaksono------------
497*2f89b7f7SBesar Wicaksono
498*2f89b7f7SBesar WicaksonoThis PMU monitors latency events of memory read requests that pass through
499*2f89b7f7SBesar Wicaksonothe NV-DLINK interface.  Bandwidth events are not available in this PMU.
500*2f89b7f7SBesar WicaksonoIn Tegra410 SoC, this PMU only counts CXL memory read traffic.
501*2f89b7f7SBesar Wicaksono
502*2f89b7f7SBesar WicaksonoThe events and configuration options of this PMU device are available in sysfs,
503*2f89b7f7SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
504*2f89b7f7SBesar Wicaksono
505*2f89b7f7SBesar WicaksonoThe list of events:
506*2f89b7f7SBesar Wicaksono
507*2f89b7f7SBesar Wicaksono  * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
508*2f89b7f7SBesar Wicaksono  * IN_RD_REQ: the number of read requests to CXL memory.
509*2f89b7f7SBesar Wicaksono  * CYCLES: NV-DLINK interface cycle counts.
510*2f89b7f7SBesar Wicaksono
511*2f89b7f7SBesar WicaksonoThe events can be used to calculate the average latency of the read requests::
512*2f89b7f7SBesar Wicaksono
513*2f89b7f7SBesar Wicaksono   DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
514*2f89b7f7SBesar Wicaksono
515*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
516*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
517*2f89b7f7SBesar Wicaksono
518*2f89b7f7SBesar WicaksonoExample usage:
519*2f89b7f7SBesar Wicaksono
520*2f89b7f7SBesar Wicaksono  * Count read events to CXL memory::
521*2f89b7f7SBesar Wicaksono
522*2f89b7f7SBesar Wicaksono      perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'
523