xref: /linux/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst (revision 0fc8f6200d2313278fbf4539bbab74677c685531)
1=====================================================================
2NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
3=====================================================================
4
5The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
6metrics like memory bandwidth, latency, and utilization:
7
8* Unified Coherence Fabric (UCF)
9* PCIE
10* PCIE-TGT
11* CPU Memory (CMEM) Latency
12* NVLink-C2C
13* NV-CLink
14* NV-DLink
15
16PMU Driver
17----------
18
19The PMU driver describes the available events and configuration of each PMU in
20sysfs. Please see the sections below to get the sysfs path of each PMU. Like
21other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
22the CPU id used to handle the PMU event. There is also "associated_cpus"
23sysfs attribute, which contains a list of CPUs associated with the PMU instance.
24
25UCF PMU
26-------
27
28The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
29distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
30interconnect that supports hardware coherence across multiple coherently caching
31agents, including:
32
33  * CPU clusters
34  * GPU
35  * PCIe Ordering Controller Unit (OCU)
36  * Other IO-coherent requesters
37
38The events and configuration options of this PMU device are described in sysfs,
39see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
40
41Some of the events available in this PMU can be used to measure bandwidth and
42utilization:
43
44  * slc_access_rd: count the number of read requests to SLC.
45  * slc_access_wr: count the number of write requests to SLC.
46  * slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
47  * slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
48  * mem_access_rd: count the number of read requests to local or remote memory.
49  * mem_access_wr: count the number of write requests to local or remote memory.
50  * mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
51  * mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
52  * cycles: counts the UCF cycles.
53
54The average bandwidth is calculated as::
55
56   AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
57   AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
58   AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
59   AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
60
61The average request rate is calculated as::
62
63   AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
64   AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
65   AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
66   AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
67
68More details about what other events are available can be found in Tegra410 SoC
69technical reference manual.
70
71The events can be filtered based on source or destination. The source filter
72indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
73remote socket. The destination filter specifies the destination memory type,
74e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
75local/remote classification of the destination filter is based on the home
76socket of the address, not where the data actually resides. The available
77filters are described in
78/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
79
80The list of UCF PMU event filters:
81
82* Source filter:
83
84  * src_loc_cpu: if set, count events from local CPU
85  * src_loc_noncpu: if set, count events from local non-CPU device
86  * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
87
88* Destination filter:
89
90  * dst_loc_cmem: if set, count events to local system memory (CMEM) address
91  * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
92  * dst_loc_other: if set, count events to local CXL memory address
93  * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
94
95If the source is not specified, the PMU will count events from all sources. If
96the destination is not specified, the PMU will count events to all destinations.
97
98Example usage:
99
100* Count event id 0x0 in socket 0 from all sources and to all destinations::
101
102    perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
103
104* Count event id 0x0 in socket 0 with source filter = local CPU and destination
105  filter = local system memory (CMEM)::
106
107    perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
108
109* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
110  destination filter = remote memory::
111
112    perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
113
114PCIE PMU
115--------
116
117This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
118the memory subsystem. It monitors all read/write traffic from the root port(s)
119or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per
120PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
121up to 8 root ports. The traffic from each root port can be filtered using RP or
122BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
123capture traffic from all RPs. Please see below for more details.
124
125The events and configuration options of this PMU device are described in sysfs,
126see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
127
128The events in this PMU can be used to measure bandwidth, utilization, and
129latency:
130
131  * rd_req: count the number of read requests by PCIE device.
132  * wr_req: count the number of write requests by PCIE device.
133  * rd_bytes: count the number of bytes transferred by rd_req.
134  * wr_bytes: count the number of bytes transferred by wr_req.
135  * rd_cum_outs: count outstanding rd_req each cycle.
136  * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
137
138The average bandwidth is calculated as::
139
140   AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
141   AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
142
143The average request rate is calculated as::
144
145   AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
146   AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
147
148
149The average latency is calculated as::
150
151   FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
152   AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
153   AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
154
155The PMU events can be filtered based on the traffic source and destination.
156The source filter indicates the PCIE devices that will be monitored. The
157destination filter specifies the destination memory type, e.g. local system
158memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
159classification of the destination filter is based on the home socket of the
160address, not where the data actually resides. These filters can be found in
161/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
162
163The list of event filters:
164
165* Source filter:
166
167  * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
168    bitmask represents the RP index in the RC. If the bit is set, all devices under
169    the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
170    devices in root port 0 to 3.
171  * src_bdf: the BDF that will be monitored. This is a 16-bit value that
172    follows formula: (bus << 8) + (device << 3) + (function). For example, the
173    value of BDF 27:01.1 is 0x2781.
174  * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
175    "src_bdf" is used to filter the traffic.
176
177  Note that Root-Port and BDF filters are mutually exclusive and the PMU in
178  each RC can only have one BDF filter for the whole counters. If BDF filter
179  is enabled, the BDF filter value will be applied to all events.
180
181* Destination filter:
182
183  * dst_loc_cmem: if set, count events to local system memory (CMEM) address
184  * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
185  * dst_loc_pcie_p2p: if set, count events to local PCIE peer address
186  * dst_loc_pcie_cxl: if set, count events to local CXL memory address
187  * dst_rem: if set, count events to remote memory address
188
189If the source filter is not specified, the PMU will count events from all root
190ports. If the destination filter is not specified, the PMU will count events
191to all destinations.
192
193Example usage:
194
195* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
196  destinations::
197
198    perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
199
200* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
201  targeting just local CMEM of socket 0::
202
203    perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
204
205* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
206  destinations::
207
208    perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
209
210* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
211  targeting just local CMEM of socket 1::
212
213    perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
214
215* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
216  destinations::
217
218    perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
219
220.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
221
222Mapping the RC# to lspci segment number
223~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
224
225Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
226Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
227for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
228contains the following information to map PCIE devices under the RP back to its RC# :
229
230  - Bus# (byte 0xc) : bus number as reported by the lspci output
231  - Segment# (byte 0xd) : segment number as reported by the lspci output
232  - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
233  - RC# (byte 0xf): root complex number associated with the RP
234  - Socket# (byte 0x10): socket number associated with the RP
235
236Example script for mapping lspci BDF to RC# and socket#::
237
238  #!/bin/bash
239  while read bdf rest; do
240    dvsec4_reg=$(lspci -vv -s $bdf | awk '
241      /Designated Vendor-Specific: Vendor=10de ID=0004/ {
242        match($0, /\[([0-9a-fA-F]+)/, arr);
243        print "0x" arr[1];
244        exit
245      }
246    ')
247    if [ -n "$dvsec4_reg" ]; then
248      bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
249      segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
250      rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
251      rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
252      socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
253      echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
254    fi
255  done < <(lspci -d 10de:)
256
257Example output::
258
259  0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
260  0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
261  0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
262  0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
263  0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
264  0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
265  0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
266  0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
267  0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
268  0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
269  0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
270  0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
271  000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
272  000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
273  000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
274  000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
275  000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
276  000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
277  000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
278  000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
279  000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
280
281PCIE-TGT PMU
282------------
283
284This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
285the memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges.
286There is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can
287have up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU
288provides RP filter to count PCIE BAR traffic to each RP and address filter to
289count access to PCIE BAR or CXL HDM ranges. The details of the filters are
290described in the following sections.
291
292Mapping the RC# to lspci segment number is similar to the PCIE PMU. Please see
293:ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
294
295The events and configuration options of this PMU device are available in sysfs,
296see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
297
298The events in this PMU can be used to measure bandwidth and utilization:
299
300  * rd_req: count the number of read requests to PCIE.
301  * wr_req: count the number of write requests to PCIE.
302  * rd_bytes: count the number of bytes transferred by rd_req.
303  * wr_bytes: count the number of bytes transferred by wr_req.
304  * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
305
306The average bandwidth is calculated as::
307
308   AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
309   AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
310
311The average request rate is calculated as::
312
313   AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
314   AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
315
316The PMU events can be filtered based on the destination root port or target
317address range. Filtering based on RP is only available for PCIE BAR traffic.
318Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be
319found in sysfs, see
320/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
321
322Destination filter settings:
323
324* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
325  corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
326  only available for PCIE BAR traffic.
327* dst_addr_base: BAR or CXL HDM filter base address.
328* dst_addr_mask: BAR or CXL HDM filter address mask.
329* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
330  address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
331  the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
332  to determine if the traffic destination address falls within the filter range::
333
334    (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
335
336  If the comparison succeeds, then the event will be counted.
337
338If the destination filter is not specified, the RP filter will be configured by default
339to count PCIE BAR traffic to all root ports.
340
341Example usage:
342
343* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
344
345    perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
346
347* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
348  0x10000 to 0x100FF on socket 0's PCIE RC-1::
349
350    perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
351
352CPU Memory (CMEM) Latency PMU
353-----------------------------
354
355This PMU monitors latency events of memory read requests from the edge of the
356Unified Coherence Fabric (UCF) to local CPU DRAM:
357
358  * RD_REQ counters: count read requests (32B per request).
359  * RD_CUM_OUTS counters: accumulated outstanding request counter, which track
360    how many cycles the read requests are in flight.
361  * CYCLES counter: counts the number of elapsed cycles.
362
363The average latency is calculated as::
364
365   FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
366   AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
367   AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
368
369The events and configuration options of this PMU device are described in sysfs,
370see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
371
372Example usage::
373
374  perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
375
376NVLink-C2C PMU
377--------------
378
379This PMU monitors latency events of memory read/write requests that pass through
380the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
381in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
382
383The events and configuration options of this PMU device are available in sysfs,
384see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
385
386The list of events:
387
388  * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
389  * IN_RD_REQ: the number of incoming read requests.
390  * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
391  * IN_WR_REQ: the number of incoming write requests.
392  * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
393  * OUT_RD_REQ: the number of outgoing read requests.
394  * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
395  * OUT_WR_REQ: the number of outgoing write requests.
396  * CYCLES: NVLink-C2C interface cycle counts.
397
398The incoming events count the reads/writes from remote device to the SoC.
399The outgoing events count the reads/writes from the SoC to remote device.
400
401The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
402contains the information about the connected device.
403
404When the C2C interface is connected to GPU(s), the user can use the
405"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
406index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
407The PMU will monitor all GPUs by default if not specified.
408
409When connected to another SoC, only the read events are available.
410
411The events can be used to calculate the average latency of the read/write requests::
412
413   C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
414
415   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
416   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
417
418   IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
419   IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
420
421   OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
422   OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
423
424   OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
425   OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
426
427Example usage:
428
429  * Count incoming traffic from all GPUs connected via NVLink-C2C::
430
431      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
432
433  * Count incoming traffic from GPU 0 connected via NVLink-C2C::
434
435      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
436
437  * Count incoming traffic from GPU 1 connected via NVLink-C2C::
438
439      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
440
441  * Count outgoing traffic to all GPUs connected via NVLink-C2C::
442
443      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
444
445  * Count outgoing traffic to GPU 0 connected via NVLink-C2C::
446
447      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
448
449  * Count outgoing traffic to GPU 1 connected via NVLink-C2C::
450
451      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
452
453NV-CLink PMU
454------------
455
456This PMU monitors latency events of memory read requests that pass through
457the NV-CLINK interface. Bandwidth events are not available in this PMU.
458In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
459SoC and this PMU only counts read traffic.
460
461The events and configuration options of this PMU device are available in sysfs,
462see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
463
464The list of events:
465
466  * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
467  * IN_RD_REQ: the number of incoming read requests.
468  * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
469  * OUT_RD_REQ: the number of outgoing read requests.
470  * CYCLES: NV-CLINK interface cycle counts.
471
472The incoming events count the reads from remote device to the SoC.
473The outgoing events count the reads from the SoC to remote device.
474
475The events can be used to calculate the average latency of the read requests::
476
477   CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
478
479   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
480   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
481
482   OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
483   OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
484
485Example usage:
486
487  * Count incoming read traffic from remote SoC connected via NV-CLINK::
488
489      perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
490
491  * Count outgoing read traffic to remote SoC connected via NV-CLINK::
492
493      perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
494
495NV-DLink PMU
496------------
497
498This PMU monitors latency events of memory read requests that pass through
499the NV-DLINK interface.  Bandwidth events are not available in this PMU.
500In Tegra410 SoC, this PMU only counts CXL memory read traffic.
501
502The events and configuration options of this PMU device are available in sysfs,
503see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
504
505The list of events:
506
507  * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
508  * IN_RD_REQ: the number of read requests to CXL memory.
509  * CYCLES: NV-DLINK interface cycle counts.
510
511The events can be used to calculate the average latency of the read requests::
512
513   DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
514
515   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
516   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
517
518Example usage:
519
520  * Count read events to CXL memory::
521
522      perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'
523