NVIDIA DCGM Exporter Metrics

NVIDIA DCGM Exporter Metrics

Organizes Metrics exposed by NVIDIA DCGM Exporter.

1. Metric List

1.1. Utilization Metrics

Represents GPU utilization rate

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_GPU_UTILRepresents overall GPU utilization rateGaugePercentage (0 ~ 100)
DCGM_FI_DEV_MEM_COPY_UTILRatio of time during which Data copy was performed to GPU Memory during a specific periodGaugePercentage (0 ~ 100)
DCGM_FI_DEV_ENC_UTILEncoder utilization rateGaugePercentage (0 ~ 1)
DCGM_FI_DEV_DEC_UTILDecoder utilization rateGaugePercentage (0 ~ 1)
[Table 1] NVIDIA DCGM Exporter Utilization Metrics

1.2. Memory Metrics

Represents GPU Memory usage

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_FB_FREEAvailable GPU Memory capacityGaugeMB
DCGM_FI_DEV_FB_USEDGPU Memory capacity in useGaugeMB
[Table 2] NVIDIA DCGM Exporter Memory Metrics

1.3. Clock Metrics

Represents Clock status of GPU and GPU Memory

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_SM_CLOCKSM (Streaming Multiprocessor) ClockGaugeMHz
DCGM_FI_DEV_MEM_CLOCKGPU Memory ClockGaugeMHz
[Table 3] NVIDIA DCGM Exporter Clock Metrics

1.4. NVLink Metrics

NVLink Metric

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTALNumber of Flow-Control CRC Errors that occurred in NVLinkCounter
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTALNumber of Data CRC Errors that occurred in NVLinkCounter
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTALNumber of Retries that occurred in NVLinkCounter
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTALNumber of Recoveries that occurred in NVLinkCounter
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALNVLink Bandwidth CounterCounter
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0Amount of Data sent and received through activated NVLink (Header & Payload)CounterByte
[Table 4] NVIDIA DCGM Exporter NVLink Metrics

1.5. PCIe Metrics

PCIe Metric

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_PCIE_REPLAY_COUNTERNumber of retries due to Packet transmission errors in PCIeCounter
[Table 5] NVIDIA DCGM Exporter PCIe Metrics

1.6. DCP (Profiling) Metrics

Profiling Metric

MetricDescriptionMetric TypeValue Unit
DCGM_FI_PROF_GR_ENGINE_ACTIVERatio of time during which CUDA Cores inside SM (Streaming Multiprocessor) operated during a specific periodGaugePercentage (0 ~ 1)
DCGM_FI_PROF_PIPE_TENSOR_ACTIVERatio of time during which Tensor Core operated during a specific periodGaugePercentage (0 ~ 1)
DCGM_FI_PROF_DRAM_ACTIVERatio of time during which GPU Memory operated during a specific periodGaugePercentage (0 ~ 1)
DCGM_FI_PROF_PCIE_RX_BYTESAmount of Data GPU receives from PCIe (Header & Payload)GaugeBytes per second
DCGM_FI_PROF_PCIE_TX_BYTESAmount of Data GPU sends to PCIe (Header & Payload)GaugeBytes per second
[Table 6] NVIDIA DCGM Exporter DCP (Profiling) Metrics

1.7. Remapping Rows Metrics

Remapping Row Metric performed in GPU Memory. Remapping Row refers to the function of replacing abnormal Rows in GPU Memory with other normal Rows. Row refers to the minimum storage unit inside memory chips.

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWSNumber of Errors that can be corrected through Row Remapping in GPU MemoryCounter
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWSNumber of Errors that cannot be corrected through Row Remapping in GPU MemoryCounter
DCGM_FI_DEV_ROW_REMAP_FAILURENumber of failed attempts to perform Row Remapping in GPU MemoryCounter
[Table 7] NVIDIA DCGM Exporter Remapping Rows Metrics

1.8. ECC Metrics

GPU Memory’s ECC (Error-Correcting Code) Metric

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_ECC_SBE_VOL_TOTALNumber of volatile Single-Bit ErrorsCounter
DCGM_FI_DEV_ECC_DBE_VOL_TOTALNumber of volatile Double-Bit ErrorsCounter
DCGM_FI_DEV_ECC_SBE_AGG_TOTALNumber of permanent Single-Bit ErrorsCounter
DCGM_FI_DEV_ECC_DBE_AGG_TOTALNumber of permanent Double-Bit ErrorsCounter
[Table 8] NVIDIA DCGM Exporter ECC Metrics

1.9. Retired Pages Metrics

GPU Memory’s Retired Pages Metric. Retired Page refers to the function of deleting abnormal Pages to improve GPU Memory stability.

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_RETIRED_SBENumber of Pages deleted due to volatile Single-Bit ErrorsCounter
DCGM_FI_DEV_RETIRED_DBENumber of Pages deleted due to volatile Double-Bit ErrorsCounter
DCGM_FI_DEV_RETIRED_PENDINGNumber of Pages pending deletionCounter
[Table 9] NVIDIA DCGM Exporter Retired Pages Metrics

1.10. Error and Violation Metrics

Error and Violation Metric

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_XID_ERRORSLast occurred XID Error CodeGauge
DCGM_FI_DEV_POWER_VIOLATIONThrottling time due to power limitCounterus
DCGM_FI_DEV_THERMAL_VIOLATIONThrottling time due to temperature limitCounterus
DCGM_FI_DEV_SYNC_BOOST_VIOLATIONThrottling time due to Sync-Boost limitCounterus
DCGM_FI_DEV_BOARD_LIMIT_VIOLATIONThrottling time due to board limitCounterus
DCGM_FI_DEV_LOW_UTIL_VIOLATIONThrottling time due to utilization limitCounterus
DCGM_FI_DEV_RELIABILITY_VIOLATIONThrottling time due to reliability limitCounterus
[Table 10] NVIDIA DCGM Exporter Error and Violation Metrics

1.11. Power Metrics

Power consumption Metric

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_POWER_USAGEGPU power consumptionGaugeWatt
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONTotal energy consumed since GPU Driver started operatingCountermJ
[Table 11] NVIDIA DCGM Exporter Power Metrics

1.12. Temperature Metrics

Temperature Metric

MetricDescriptionMetric TypeValue Unit
DCGM_FI_DEV_GPU_TEMPGPU temperatureGaugeCelsius
DCGM_FI_DEV_MEMORY_TEMPGPU Memory temperatureGaugeCelsius
[Table 12] NVIDIA DCGM Exporter Temperature Metrics

1.13. License Metrics

License status Metric

MetricDescriptionMetric TypeValue
DCGM_FI_DEV_VGPU_LICENSE_STATUSLicense status required when using vGPU functionalityGauge0 : State where vGPU License does not exist, 1 : State where vGPU License exists
[Table 13] NVIDIA DCGM Exporter License Metrics

2. Metric Label

LabelDescriptionValueNote
UUIDGPU’s UUIDGPU-{UUID}
gpuGPU number0, 1, 2, …
deviceGPU Device Numbernvidia0, nvidia1, …
HostnameName of Host where GPU is installed
modelNameGPU model name
pci_bus_idGPU’s PCI Bus ID
exported_namespaceNamespace name where Pod using GPU is locatedK8s Pod Metric
exported_podPod name using GPUK8s Pod Metric
exported_containerContainer name using GPUK8s Pod Metric
DCGM_FI_DRIVER_VERSIONGPU Driver versionMIG Metric
GPU_I_PROFILEMIG Profile nameMIG Metric
GPU_I_IDMIG Instance IDMIG Metric
[Table 14] NVIDIA DCGM Exporter Metric Label
  • Note
    • K8s Pod Metric : Label attached to GPU allocated to K8s Pod
    • MIG Metric : Label attached only to GPU Instance created using MIG (Multi-Instance GPU) functionality

3. References