1===================================================================== 2NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU) 3===================================================================== 4 5The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance 6metrics like memory bandwidth, latency, and utilization: 7 8* Unified Coherence Fabric (UCF) 9* PCIE 10* PCIE-TGT 11* CPU Memory (CMEM) Latency 12* NVLink-C2C 13* NV-CLink 14* NV-DLink 15 16PMU Driver 17---------- 18 19The PMU driver describes the available events and configuration of each PMU in 20sysfs. Please see the sections below to get the sysfs path of each PMU. Like 21other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show 22the CPU id used to handle the PMU event. There is also "associated_cpus" 23sysfs attribute, which contains a list of CPUs associated with the PMU instance. 24 25UCF PMU 26------- 27 28The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a 29distributed cache, last level for CPU Memory and CXL Memory, and cache coherent 30interconnect that supports hardware coherence across multiple coherently caching 31agents, including: 32 33 * CPU clusters 34 * GPU 35 * PCIe Ordering Controller Unit (OCU) 36 * Other IO-coherent requesters 37 38The events and configuration options of this PMU device are described in sysfs, 39see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>. 40 41Some of the events available in this PMU can be used to measure bandwidth and 42utilization: 43 44 * slc_access_rd: count the number of read requests to SLC. 45 * slc_access_wr: count the number of write requests to SLC. 46 * slc_bytes_rd: count the number of bytes transferred by slc_access_rd. 47 * slc_bytes_wr: count the number of bytes transferred by slc_access_wr. 48 * mem_access_rd: count the number of read requests to local or remote memory. 49 * mem_access_wr: count the number of write requests to local or remote memory. 50 * mem_bytes_rd: count the number of bytes transferred by mem_access_rd. 51 * mem_bytes_wr: count the number of bytes transferred by mem_access_wr. 52 * cycles: counts the UCF cycles. 53 54The average bandwidth is calculated as:: 55 56 AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS 57 AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS 58 AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS 59 AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS 60 61The average request rate is calculated as:: 62 63 AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES 64 AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES 65 AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES 66 AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES 67 68More details about what other events are available can be found in Tegra410 SoC 69technical reference manual. 70 71The events can be filtered based on source or destination. The source filter 72indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or 73remote socket. The destination filter specifies the destination memory type, 74e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The 75local/remote classification of the destination filter is based on the home 76socket of the address, not where the data actually resides. The available 77filters are described in 78/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/. 79 80The list of UCF PMU event filters: 81 82* Source filter: 83 84 * src_loc_cpu: if set, count events from local CPU 85 * src_loc_noncpu: if set, count events from local non-CPU device 86 * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket 87 88* Destination filter: 89 90 * dst_loc_cmem: if set, count events to local system memory (CMEM) address 91 * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address 92 * dst_loc_other: if set, count events to local CXL memory address 93 * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket 94 95If the source is not specified, the PMU will count events from all sources. If 96the destination is not specified, the PMU will count events to all destinations. 97 98Example usage: 99 100* Count event id 0x0 in socket 0 from all sources and to all destinations:: 101 102 perf stat -a -e nvidia_ucf_pmu_0/event=0x0/ 103 104* Count event id 0x0 in socket 0 with source filter = local CPU and destination 105 filter = local system memory (CMEM):: 106 107 perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/ 108 109* Count event id 0x0 in socket 1 with source filter = local non-CPU device and 110 destination filter = remote memory:: 111 112 perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/ 113 114PCIE PMU 115-------- 116 117This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and 118the memory subsystem. It monitors all read/write traffic from the root port(s) 119or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per 120PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into 121up to 8 root ports. The traffic from each root port can be filtered using RP or 122BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will 123capture traffic from all RPs. Please see below for more details. 124 125The events and configuration options of this PMU device are described in sysfs, 126see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>. 127 128The events in this PMU can be used to measure bandwidth, utilization, and 129latency: 130 131 * rd_req: count the number of read requests by PCIE device. 132 * wr_req: count the number of write requests by PCIE device. 133 * rd_bytes: count the number of bytes transferred by rd_req. 134 * wr_bytes: count the number of bytes transferred by wr_req. 135 * rd_cum_outs: count outstanding rd_req each cycle. 136 * cycles: count the clock cycles of SOC fabric connected to the PCIE interface. 137 138The average bandwidth is calculated as:: 139 140 AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS 141 AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS 142 143The average request rate is calculated as:: 144 145 AVG_RD_REQUEST_RATE = RD_REQ / CYCLES 146 AVG_WR_REQUEST_RATE = WR_REQ / CYCLES 147 148 149The average latency is calculated as:: 150 151 FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 152 AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ 153 AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ 154 155The PMU events can be filtered based on the traffic source and destination. 156The source filter indicates the PCIE devices that will be monitored. The 157destination filter specifies the destination memory type, e.g. local system 158memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote 159classification of the destination filter is based on the home socket of the 160address, not where the data actually resides. These filters can be found in 161/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/. 162 163The list of event filters: 164 165* Source filter: 166 167 * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this 168 bitmask represents the RP index in the RC. If the bit is set, all devices under 169 the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor 170 devices in root port 0 to 3. 171 * src_bdf: the BDF that will be monitored. This is a 16-bit value that 172 follows formula: (bus << 8) + (device << 3) + (function). For example, the 173 value of BDF 27:01.1 is 0x2781. 174 * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in 175 "src_bdf" is used to filter the traffic. 176 177 Note that Root-Port and BDF filters are mutually exclusive and the PMU in 178 each RC can only have one BDF filter for the whole counters. If BDF filter 179 is enabled, the BDF filter value will be applied to all events. 180 181* Destination filter: 182 183 * dst_loc_cmem: if set, count events to local system memory (CMEM) address 184 * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address 185 * dst_loc_pcie_p2p: if set, count events to local PCIE peer address 186 * dst_loc_pcie_cxl: if set, count events to local CXL memory address 187 * dst_rem: if set, count events to remote memory address 188 189If the source filter is not specified, the PMU will count events from all root 190ports. If the destination filter is not specified, the PMU will count events 191to all destinations. 192 193Example usage: 194 195* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all 196 destinations:: 197 198 perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/ 199 200* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and 201 targeting just local CMEM of socket 0:: 202 203 perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/ 204 205* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all 206 destinations:: 207 208 perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/ 209 210* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and 211 targeting just local CMEM of socket 1:: 212 213 perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/ 214 215* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all 216 destinations:: 217 218 perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/ 219 220.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section: 221 222Mapping the RC# to lspci segment number 223~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 224 225Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA 226Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space 227for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register 228contains the following information to map PCIE devices under the RP back to its RC# : 229 230 - Bus# (byte 0xc) : bus number as reported by the lspci output 231 - Segment# (byte 0xd) : segment number as reported by the lspci output 232 - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability 233 - RC# (byte 0xf): root complex number associated with the RP 234 - Socket# (byte 0x10): socket number associated with the RP 235 236Example script for mapping lspci BDF to RC# and socket#:: 237 238 #!/bin/bash 239 while read bdf rest; do 240 dvsec4_reg=$(lspci -vv -s $bdf | awk ' 241 /Designated Vendor-Specific: Vendor=10de ID=0004/ { 242 match($0, /\[([0-9a-fA-F]+)/, arr); 243 print "0x" arr[1]; 244 exit 245 } 246 ') 247 if [ -n "$dvsec4_reg" ]; then 248 bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b) 249 segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b) 250 rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b) 251 rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b) 252 socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b) 253 echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket" 254 fi 255 done < <(lspci -d 10de:) 256 257Example output:: 258 259 0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00 260 0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00 261 0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00 262 0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00 263 0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00 264 0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00 265 0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00 266 0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00 267 0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00 268 0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00 269 0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00 270 0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01 271 000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01 272 000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01 273 000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01 274 000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01 275 000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01 276 000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01 277 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01 278 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01 279 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01 280 281PCIE-TGT PMU 282------------ 283 284This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and 285the memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges. 286There is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can 287have up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU 288provides RP filter to count PCIE BAR traffic to each RP and address filter to 289count access to PCIE BAR or CXL HDM ranges. The details of the filters are 290described in the following sections. 291 292Mapping the RC# to lspci segment number is similar to the PCIE PMU. Please see 293:ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info. 294 295The events and configuration options of this PMU device are available in sysfs, 296see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>. 297 298The events in this PMU can be used to measure bandwidth and utilization: 299 300 * rd_req: count the number of read requests to PCIE. 301 * wr_req: count the number of write requests to PCIE. 302 * rd_bytes: count the number of bytes transferred by rd_req. 303 * wr_bytes: count the number of bytes transferred by wr_req. 304 * cycles: count the clock cycles of SOC fabric connected to the PCIE interface. 305 306The average bandwidth is calculated as:: 307 308 AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS 309 AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS 310 311The average request rate is calculated as:: 312 313 AVG_RD_REQUEST_RATE = RD_REQ / CYCLES 314 AVG_WR_REQUEST_RATE = WR_REQ / CYCLES 315 316The PMU events can be filtered based on the destination root port or target 317address range. Filtering based on RP is only available for PCIE BAR traffic. 318Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be 319found in sysfs, see 320/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/. 321 322Destination filter settings: 323 324* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF" 325 corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is 326 only available for PCIE BAR traffic. 327* dst_addr_base: BAR or CXL HDM filter base address. 328* dst_addr_mask: BAR or CXL HDM filter address mask. 329* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the 330 address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter 331 the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison 332 to determine if the traffic destination address falls within the filter range:: 333 334 (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask) 335 336 If the comparison succeeds, then the event will be counted. 337 338If the destination filter is not specified, the RP filter will be configured by default 339to count PCIE BAR traffic to all root ports. 340 341Example usage: 342 343* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0:: 344 345 perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/ 346 347* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range 348 0x10000 to 0x100FF on socket 0's PCIE RC-1:: 349 350 perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/ 351 352CPU Memory (CMEM) Latency PMU 353----------------------------- 354 355This PMU monitors latency events of memory read requests from the edge of the 356Unified Coherence Fabric (UCF) to local CPU DRAM: 357 358 * RD_REQ counters: count read requests (32B per request). 359 * RD_CUM_OUTS counters: accumulated outstanding request counter, which track 360 how many cycles the read requests are in flight. 361 * CYCLES counter: counts the number of elapsed cycles. 362 363The average latency is calculated as:: 364 365 FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 366 AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ 367 AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ 368 369The events and configuration options of this PMU device are described in sysfs, 370see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>. 371 372Example usage:: 373 374 perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}' 375 376NVLink-C2C PMU 377-------------- 378 379This PMU monitors latency events of memory read/write requests that pass through 380the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available 381in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC). 382 383The events and configuration options of this PMU device are available in sysfs, 384see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>. 385 386The list of events: 387 388 * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests. 389 * IN_RD_REQ: the number of incoming read requests. 390 * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests. 391 * IN_WR_REQ: the number of incoming write requests. 392 * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests. 393 * OUT_RD_REQ: the number of outgoing read requests. 394 * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests. 395 * OUT_WR_REQ: the number of outgoing write requests. 396 * CYCLES: NVLink-C2C interface cycle counts. 397 398The incoming events count the reads/writes from remote device to the SoC. 399The outgoing events count the reads/writes from the SoC to remote device. 400 401The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer 402contains the information about the connected device. 403 404When the C2C interface is connected to GPU(s), the user can use the 405"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU 406index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1. 407The PMU will monitor all GPUs by default if not specified. 408 409When connected to another SoC, only the read events are available. 410 411The events can be used to calculate the average latency of the read/write requests:: 412 413 C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 414 415 IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ 416 IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ 417 418 IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ 419 IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ 420 421 OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ 422 OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ 423 424 OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ 425 OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ 426 427Example usage: 428 429 * Count incoming traffic from all GPUs connected via NVLink-C2C:: 430 431 perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/ 432 433 * Count incoming traffic from GPU 0 connected via NVLink-C2C:: 434 435 perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/ 436 437 * Count incoming traffic from GPU 1 connected via NVLink-C2C:: 438 439 perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/ 440 441 * Count outgoing traffic to all GPUs connected via NVLink-C2C:: 442 443 perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/ 444 445 * Count outgoing traffic to GPU 0 connected via NVLink-C2C:: 446 447 perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/ 448 449 * Count outgoing traffic to GPU 1 connected via NVLink-C2C:: 450 451 perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/ 452 453NV-CLink PMU 454------------ 455 456This PMU monitors latency events of memory read requests that pass through 457the NV-CLINK interface. Bandwidth events are not available in this PMU. 458In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410 459SoC and this PMU only counts read traffic. 460 461The events and configuration options of this PMU device are available in sysfs, 462see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>. 463 464The list of events: 465 466 * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests. 467 * IN_RD_REQ: the number of incoming read requests. 468 * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests. 469 * OUT_RD_REQ: the number of outgoing read requests. 470 * CYCLES: NV-CLINK interface cycle counts. 471 472The incoming events count the reads from remote device to the SoC. 473The outgoing events count the reads from the SoC to remote device. 474 475The events can be used to calculate the average latency of the read requests:: 476 477 CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 478 479 IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ 480 IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ 481 482 OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ 483 OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ 484 485Example usage: 486 487 * Count incoming read traffic from remote SoC connected via NV-CLINK:: 488 489 perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/ 490 491 * Count outgoing read traffic to remote SoC connected via NV-CLINK:: 492 493 perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/ 494 495NV-DLink PMU 496------------ 497 498This PMU monitors latency events of memory read requests that pass through 499the NV-DLINK interface. Bandwidth events are not available in this PMU. 500In Tegra410 SoC, this PMU only counts CXL memory read traffic. 501 502The events and configuration options of this PMU device are available in sysfs, 503see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>. 504 505The list of events: 506 507 * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory. 508 * IN_RD_REQ: the number of read requests to CXL memory. 509 * CYCLES: NV-DLINK interface cycle counts. 510 511The events can be used to calculate the average latency of the read requests:: 512 513 DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS 514 515 IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ 516 IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ 517 518Example usage: 519 520 * Count read events to CXL memory:: 521 522 perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}' 523