admin-guide/perf/nvidia-tegra410-pmu.rst

f5caf26fSBesar Wicaksono=====================================================================
f5caf26fSBesar WicaksonoNVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
f5caf26fSBesar Wicaksono=====================================================================
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoThe NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
f5caf26fSBesar Wicaksonometrics like memory bandwidth, latency, and utilization:
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono* Unified Coherence Fabric (UCF)
bf585ba1SBesar Wicaksono* PCIE
3dd73022SBesar Wicaksono* PCIE-TGT
429b7638SBesar Wicaksono* CPU Memory (CMEM) Latency
*2f89b7f7SBesar Wicaksono* NVLink-C2C
*2f89b7f7SBesar Wicaksono* NV-CLink
*2f89b7f7SBesar Wicaksono* NV-DLink
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoPMU Driver
f5caf26fSBesar Wicaksono----------
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoThe PMU driver describes the available events and configuration of each PMU in
f5caf26fSBesar Wicaksonosysfs. Please see the sections below to get the sysfs path of each PMU. Like
f5caf26fSBesar Wicaksonoother uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
f5caf26fSBesar Wicaksonothe CPU id used to handle the PMU event. There is also "associated_cpus"
f5caf26fSBesar Wicaksonosysfs attribute, which contains a list of CPUs associated with the PMU instance.
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoUCF PMU
f5caf26fSBesar Wicaksono-------
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoThe Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
f5caf26fSBesar Wicaksonodistributed cache, last level for CPU Memory and CXL Memory, and cache coherent
f5caf26fSBesar Wicaksonointerconnect that supports hardware coherence across multiple coherently caching
f5caf26fSBesar Wicaksonoagents, including:
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono  * CPU clusters
f5caf26fSBesar Wicaksono  * GPU
f5caf26fSBesar Wicaksono  * PCIe Ordering Controller Unit (OCU)
f5caf26fSBesar Wicaksono  * Other IO-coherent requesters
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoThe events and configuration options of this PMU device are described in sysfs,
f5caf26fSBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoSome of the events available in this PMU can be used to measure bandwidth and
f5caf26fSBesar Wicaksonoutilization:
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono  * slc_access_rd: count the number of read requests to SLC.
f5caf26fSBesar Wicaksono  * slc_access_wr: count the number of write requests to SLC.
f5caf26fSBesar Wicaksono  * slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
f5caf26fSBesar Wicaksono  * slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
f5caf26fSBesar Wicaksono  * mem_access_rd: count the number of read requests to local or remote memory.
f5caf26fSBesar Wicaksono  * mem_access_wr: count the number of write requests to local or remote memory.
f5caf26fSBesar Wicaksono  * mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
f5caf26fSBesar Wicaksono  * mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
f5caf26fSBesar Wicaksono  * cycles: counts the UCF cycles.
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoThe average bandwidth is calculated as::
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono   AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
f5caf26fSBesar Wicaksono   AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
f5caf26fSBesar Wicaksono   AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
f5caf26fSBesar Wicaksono   AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoThe average request rate is calculated as::
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono   AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
f5caf26fSBesar Wicaksono   AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
f5caf26fSBesar Wicaksono   AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
f5caf26fSBesar Wicaksono   AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoMore details about what other events are available can be found in Tegra410 SoC
f5caf26fSBesar Wicaksonotechnical reference manual.
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoThe events can be filtered based on source or destination. The source filter
f5caf26fSBesar Wicaksonoindicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
f5caf26fSBesar Wicaksonoremote socket. The destination filter specifies the destination memory type,
f5caf26fSBesar Wicaksonoe.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
f5caf26fSBesar Wicaksonolocal/remote classification of the destination filter is based on the home
f5caf26fSBesar Wicaksonosocket of the address, not where the data actually resides. The available
f5caf26fSBesar Wicaksonofilters are described in
f5caf26fSBesar Wicaksono/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoThe list of UCF PMU event filters:
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono* Source filter:
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono  * src_loc_cpu: if set, count events from local CPU
f5caf26fSBesar Wicaksono  * src_loc_noncpu: if set, count events from local non-CPU device
f5caf26fSBesar Wicaksono  * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono* Destination filter:
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono  * dst_loc_cmem: if set, count events to local system memory (CMEM) address
f5caf26fSBesar Wicaksono  * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
f5caf26fSBesar Wicaksono  * dst_loc_other: if set, count events to local CXL memory address
f5caf26fSBesar Wicaksono  * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoIf the source is not specified, the PMU will count events from all sources. If
f5caf26fSBesar Wicaksonothe destination is not specified, the PMU will count events to all destinations.
f5caf26fSBesar Wicaksono
f5caf26fSBesar WicaksonoExample usage:
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono* Count event id 0x0 in socket 0 from all sources and to all destinations::
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono    perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono* Count event id 0x0 in socket 0 with source filter = local CPU and destination
f5caf26fSBesar Wicaksono  filter = local system memory (CMEM)::
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono    perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
f5caf26fSBesar Wicaksono  destination filter = remote memory::
f5caf26fSBesar Wicaksono
f5caf26fSBesar Wicaksono    perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoPCIE PMU
bf585ba1SBesar Wicaksono--------
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoThis PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
bf585ba1SBesar Wicaksonothe memory subsystem. It monitors all read/write traffic from the root port(s)
bf585ba1SBesar Wicaksonoor a particular BDF in a PCIE RC to local or remote memory. There is one PMU per
bf585ba1SBesar WicaksonoPCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
bf585ba1SBesar Wicaksonoup to 8 root ports. The traffic from each root port can be filtered using RP or
bf585ba1SBesar WicaksonoBDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
bf585ba1SBesar Wicaksonocapture traffic from all RPs. Please see below for more details.
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoThe events and configuration options of this PMU device are described in sysfs,
bf585ba1SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoThe events in this PMU can be used to measure bandwidth, utilization, and
bf585ba1SBesar Wicaksonolatency:
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono  * rd_req: count the number of read requests by PCIE device.
bf585ba1SBesar Wicaksono  * wr_req: count the number of write requests by PCIE device.
bf585ba1SBesar Wicaksono  * rd_bytes: count the number of bytes transferred by rd_req.
bf585ba1SBesar Wicaksono  * wr_bytes: count the number of bytes transferred by wr_req.
bf585ba1SBesar Wicaksono  * rd_cum_outs: count outstanding rd_req each cycle.
bf585ba1SBesar Wicaksono  * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoThe average bandwidth is calculated as::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono   AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
bf585ba1SBesar Wicaksono   AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoThe average request rate is calculated as::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono   AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
bf585ba1SBesar Wicaksono   AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoThe average latency is calculated as::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono   FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
bf585ba1SBesar Wicaksono   AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
bf585ba1SBesar Wicaksono   AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoThe PMU events can be filtered based on the traffic source and destination.
bf585ba1SBesar WicaksonoThe source filter indicates the PCIE devices that will be monitored. The
bf585ba1SBesar Wicaksonodestination filter specifies the destination memory type, e.g. local system
bf585ba1SBesar Wicaksonomemory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
bf585ba1SBesar Wicaksonoclassification of the destination filter is based on the home socket of the
bf585ba1SBesar Wicaksonoaddress, not where the data actually resides. These filters can be found in
bf585ba1SBesar Wicaksono/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoThe list of event filters:
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono* Source filter:
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono  * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
bf585ba1SBesar Wicaksono    bitmask represents the RP index in the RC. If the bit is set, all devices under
bf585ba1SBesar Wicaksono    the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
bf585ba1SBesar Wicaksono    devices in root port 0 to 3.
bf585ba1SBesar Wicaksono  * src_bdf: the BDF that will be monitored. This is a 16-bit value that
bf585ba1SBesar Wicaksono    follows formula: (bus << 8) + (device << 3) + (function). For example, the
bf585ba1SBesar Wicaksono    value of BDF 27:01.1 is 0x2781.
bf585ba1SBesar Wicaksono  * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
bf585ba1SBesar Wicaksono    "src_bdf" is used to filter the traffic.
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono  Note that Root-Port and BDF filters are mutually exclusive and the PMU in
bf585ba1SBesar Wicaksono  each RC can only have one BDF filter for the whole counters. If BDF filter
bf585ba1SBesar Wicaksono  is enabled, the BDF filter value will be applied to all events.
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono* Destination filter:
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono  * dst_loc_cmem: if set, count events to local system memory (CMEM) address
bf585ba1SBesar Wicaksono  * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
bf585ba1SBesar Wicaksono  * dst_loc_pcie_p2p: if set, count events to local PCIE peer address
bf585ba1SBesar Wicaksono  * dst_loc_pcie_cxl: if set, count events to local CXL memory address
bf585ba1SBesar Wicaksono  * dst_rem: if set, count events to remote memory address
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoIf the source filter is not specified, the PMU will count events from all root
bf585ba1SBesar Wicaksonoports. If the destination filter is not specified, the PMU will count events
bf585ba1SBesar Wicaksonoto all destinations.
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoExample usage:
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
bf585ba1SBesar Wicaksono  destinations::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
bf585ba1SBesar Wicaksono  targeting just local CMEM of socket 0::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
bf585ba1SBesar Wicaksono  destinations::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
bf585ba1SBesar Wicaksono  targeting just local CMEM of socket 1::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
bf585ba1SBesar Wicaksono  destinations::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono    perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
bf585ba1SBesar Wicaksono
3dd73022SBesar Wicaksono.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoMapping the RC# to lspci segment number
3dd73022SBesar Wicaksono~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3dd73022SBesar Wicaksono
bf585ba1SBesar WicaksonoMapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
bf585ba1SBesar WicaksonoDesignated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
bf585ba1SBesar Wicaksonofor each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
bf585ba1SBesar Wicaksonocontains the following information to map PCIE devices under the RP back to its RC# :
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono  - Bus# (byte 0xc) : bus number as reported by the lspci output
bf585ba1SBesar Wicaksono  - Segment# (byte 0xd) : segment number as reported by the lspci output
bf585ba1SBesar Wicaksono  - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
bf585ba1SBesar Wicaksono  - RC# (byte 0xf): root complex number associated with the RP
bf585ba1SBesar Wicaksono  - Socket# (byte 0x10): socket number associated with the RP
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoExample script for mapping lspci BDF to RC# and socket#::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono  #!/bin/bash
bf585ba1SBesar Wicaksono  while read bdf rest; do
bf585ba1SBesar Wicaksono    dvsec4_reg=$(lspci -vv -s $bdf | awk '
bf585ba1SBesar Wicaksono      /Designated Vendor-Specific: Vendor=10de ID=0004/ {
bf585ba1SBesar Wicaksono        match($0, /\[([0-9a-fA-F]+)/, arr);
bf585ba1SBesar Wicaksono        print "0x" arr[1];
bf585ba1SBesar Wicaksono        exit
bf585ba1SBesar Wicaksono      }
bf585ba1SBesar Wicaksono    ')
bf585ba1SBesar Wicaksono    if [ -n "$dvsec4_reg" ]; then
bf585ba1SBesar Wicaksono      bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
bf585ba1SBesar Wicaksono      segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
bf585ba1SBesar Wicaksono      rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
bf585ba1SBesar Wicaksono      rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
bf585ba1SBesar Wicaksono      socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
bf585ba1SBesar Wicaksono      echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
bf585ba1SBesar Wicaksono    fi
bf585ba1SBesar Wicaksono  done < <(lspci -d 10de:)
bf585ba1SBesar Wicaksono
bf585ba1SBesar WicaksonoExample output::
bf585ba1SBesar Wicaksono
bf585ba1SBesar Wicaksono  0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
bf585ba1SBesar Wicaksono  0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
bf585ba1SBesar Wicaksono  0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
bf585ba1SBesar Wicaksono  0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
bf585ba1SBesar Wicaksono  0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
bf585ba1SBesar Wicaksono  0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
bf585ba1SBesar Wicaksono  0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
bf585ba1SBesar Wicaksono  0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
bf585ba1SBesar Wicaksono  0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
bf585ba1SBesar Wicaksono  0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
bf585ba1SBesar Wicaksono  0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
bf585ba1SBesar Wicaksono  0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
bf585ba1SBesar Wicaksono  000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
bf585ba1SBesar Wicaksono  000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
bf585ba1SBesar Wicaksono  000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
bf585ba1SBesar Wicaksono  000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
bf585ba1SBesar Wicaksono  000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
bf585ba1SBesar Wicaksono  000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
bf585ba1SBesar Wicaksono  000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
bf585ba1SBesar Wicaksono  000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
bf585ba1SBesar Wicaksono  000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoPCIE-TGT PMU
3dd73022SBesar Wicaksono------------
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoThis PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
3dd73022SBesar Wicaksonothe memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges.
3dd73022SBesar WicaksonoThere is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can
3dd73022SBesar Wicaksonohave up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU
3dd73022SBesar Wicaksonoprovides RP filter to count PCIE BAR traffic to each RP and address filter to
3dd73022SBesar Wicaksonocount access to PCIE BAR or CXL HDM ranges. The details of the filters are
3dd73022SBesar Wicaksonodescribed in the following sections.
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoMapping the RC# to lspci segment number is similar to the PCIE PMU. Please see
3dd73022SBesar Wicaksono:ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoThe events and configuration options of this PMU device are available in sysfs,
3dd73022SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoThe events in this PMU can be used to measure bandwidth and utilization:
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono  * rd_req: count the number of read requests to PCIE.
3dd73022SBesar Wicaksono  * wr_req: count the number of write requests to PCIE.
3dd73022SBesar Wicaksono  * rd_bytes: count the number of bytes transferred by rd_req.
3dd73022SBesar Wicaksono  * wr_bytes: count the number of bytes transferred by wr_req.
3dd73022SBesar Wicaksono  * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoThe average bandwidth is calculated as::
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono   AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
3dd73022SBesar Wicaksono   AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoThe average request rate is calculated as::
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono   AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
3dd73022SBesar Wicaksono   AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoThe PMU events can be filtered based on the destination root port or target
3dd73022SBesar Wicaksonoaddress range. Filtering based on RP is only available for PCIE BAR traffic.
3dd73022SBesar WicaksonoAddress filter works for both PCIE BAR and CXL HDM ranges. These filters can be
3dd73022SBesar Wicaksonofound in sysfs, see
3dd73022SBesar Wicaksono/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoDestination filter settings:
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
3dd73022SBesar Wicaksono  corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
3dd73022SBesar Wicaksono  only available for PCIE BAR traffic.
3dd73022SBesar Wicaksono* dst_addr_base: BAR or CXL HDM filter base address.
3dd73022SBesar Wicaksono* dst_addr_mask: BAR or CXL HDM filter address mask.
3dd73022SBesar Wicaksono* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
3dd73022SBesar Wicaksono  address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
3dd73022SBesar Wicaksono  the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
3dd73022SBesar Wicaksono  to determine if the traffic destination address falls within the filter range::
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono    (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono  If the comparison succeeds, then the event will be counted.
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoIf the destination filter is not specified, the RP filter will be configured by default
3dd73022SBesar Wicaksonoto count PCIE BAR traffic to all root ports.
3dd73022SBesar Wicaksono
3dd73022SBesar WicaksonoExample usage:
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono    perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
3dd73022SBesar Wicaksono  0x10000 to 0x100FF on socket 0's PCIE RC-1::
3dd73022SBesar Wicaksono
3dd73022SBesar Wicaksono    perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
429b7638SBesar Wicaksono
429b7638SBesar WicaksonoCPU Memory (CMEM) Latency PMU
429b7638SBesar Wicaksono-----------------------------
429b7638SBesar Wicaksono
429b7638SBesar WicaksonoThis PMU monitors latency events of memory read requests from the edge of the
429b7638SBesar WicaksonoUnified Coherence Fabric (UCF) to local CPU DRAM:
429b7638SBesar Wicaksono
429b7638SBesar Wicaksono  * RD_REQ counters: count read requests (32B per request).
429b7638SBesar Wicaksono  * RD_CUM_OUTS counters: accumulated outstanding request counter, which track
429b7638SBesar Wicaksono    how many cycles the read requests are in flight.
429b7638SBesar Wicaksono  * CYCLES counter: counts the number of elapsed cycles.
429b7638SBesar Wicaksono
429b7638SBesar WicaksonoThe average latency is calculated as::
429b7638SBesar Wicaksono
429b7638SBesar Wicaksono   FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
429b7638SBesar Wicaksono   AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
429b7638SBesar Wicaksono   AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
429b7638SBesar Wicaksono
429b7638SBesar WicaksonoThe events and configuration options of this PMU device are described in sysfs,
429b7638SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
429b7638SBesar Wicaksono
429b7638SBesar WicaksonoExample usage::
429b7638SBesar Wicaksono
429b7638SBesar Wicaksono  perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoNVLink-C2C PMU
*2f89b7f7SBesar Wicaksono--------------
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThis PMU monitors latency events of memory read/write requests that pass through
*2f89b7f7SBesar Wicaksonothe NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
*2f89b7f7SBesar Wicaksonoin this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe events and configuration options of this PMU device are available in sysfs,
*2f89b7f7SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe list of events:
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
*2f89b7f7SBesar Wicaksono  * IN_RD_REQ: the number of incoming read requests.
*2f89b7f7SBesar Wicaksono  * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
*2f89b7f7SBesar Wicaksono  * IN_WR_REQ: the number of incoming write requests.
*2f89b7f7SBesar Wicaksono  * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
*2f89b7f7SBesar Wicaksono  * OUT_RD_REQ: the number of outgoing read requests.
*2f89b7f7SBesar Wicaksono  * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
*2f89b7f7SBesar Wicaksono  * OUT_WR_REQ: the number of outgoing write requests.
*2f89b7f7SBesar Wicaksono  * CYCLES: NVLink-C2C interface cycle counts.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe incoming events count the reads/writes from remote device to the SoC.
*2f89b7f7SBesar WicaksonoThe outgoing events count the reads/writes from the SoC to remote device.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
*2f89b7f7SBesar Wicaksonocontains the information about the connected device.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoWhen the C2C interface is connected to GPU(s), the user can use the
*2f89b7f7SBesar Wicaksono"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
*2f89b7f7SBesar Wicaksonoindex, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
*2f89b7f7SBesar WicaksonoThe PMU will monitor all GPUs by default if not specified.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoWhen connected to another SoC, only the read events are available.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe events can be used to calculate the average latency of the read/write requests::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
*2f89b7f7SBesar Wicaksono   IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
*2f89b7f7SBesar Wicaksono   OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
*2f89b7f7SBesar Wicaksono   OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoExample usage:
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count incoming traffic from all GPUs connected via NVLink-C2C::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count incoming traffic from GPU 0 connected via NVLink-C2C::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count incoming traffic from GPU 1 connected via NVLink-C2C::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count outgoing traffic to all GPUs connected via NVLink-C2C::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count outgoing traffic to GPU 0 connected via NVLink-C2C::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count outgoing traffic to GPU 1 connected via NVLink-C2C::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoNV-CLink PMU
*2f89b7f7SBesar Wicaksono------------
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThis PMU monitors latency events of memory read requests that pass through
*2f89b7f7SBesar Wicaksonothe NV-CLINK interface. Bandwidth events are not available in this PMU.
*2f89b7f7SBesar WicaksonoIn Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
*2f89b7f7SBesar WicaksonoSoC and this PMU only counts read traffic.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe events and configuration options of this PMU device are available in sysfs,
*2f89b7f7SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe list of events:
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
*2f89b7f7SBesar Wicaksono  * IN_RD_REQ: the number of incoming read requests.
*2f89b7f7SBesar Wicaksono  * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
*2f89b7f7SBesar Wicaksono  * OUT_RD_REQ: the number of outgoing read requests.
*2f89b7f7SBesar Wicaksono  * CYCLES: NV-CLINK interface cycle counts.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe incoming events count the reads from remote device to the SoC.
*2f89b7f7SBesar WicaksonoThe outgoing events count the reads from the SoC to remote device.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe events can be used to calculate the average latency of the read requests::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
*2f89b7f7SBesar Wicaksono   OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoExample usage:
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count incoming read traffic from remote SoC connected via NV-CLINK::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count outgoing read traffic to remote SoC connected via NV-CLINK::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoNV-DLink PMU
*2f89b7f7SBesar Wicaksono------------
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThis PMU monitors latency events of memory read requests that pass through
*2f89b7f7SBesar Wicaksonothe NV-DLINK interface.  Bandwidth events are not available in this PMU.
*2f89b7f7SBesar WicaksonoIn Tegra410 SoC, this PMU only counts CXL memory read traffic.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe events and configuration options of this PMU device are available in sysfs,
*2f89b7f7SBesar Wicaksonosee /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe list of events:
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
*2f89b7f7SBesar Wicaksono  * IN_RD_REQ: the number of read requests to CXL memory.
*2f89b7f7SBesar Wicaksono  * CYCLES: NV-DLINK interface cycle counts.
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoThe events can be used to calculate the average latency of the read requests::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
*2f89b7f7SBesar Wicaksono   IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar WicaksonoExample usage:
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono  * Count read events to CXL memory::
*2f89b7f7SBesar Wicaksono
*2f89b7f7SBesar Wicaksono      perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'