Nvidia Data Center GPU Manager (DCGM)

Plugin: go.d.plugin Module: dcgm

Overview

This collector gathers NVIDIA GPU telemetry from a dcgm-exporter endpoint. It supports all numeric fields exposed by the exporter and maps them into Netdata-native contexts.

It collects metrics by periodically scraping the exporter Prometheus endpoint over HTTP.

This collector is supported on all platforms.

This collector supports collecting metrics from multiple instances of this integration, including remote instances.

Nvidia Data Center GPU Manager (DCGM) can be monitored further using the following other integrations:

{% relatedResource id=“go.d.plugin-nvidia_smi-Nvidia_GPU” %}Nvidia GPU{% /relatedResource %}

Default Behavior

Auto-Detection

This integration does not support auto-detection in v1.

Limits

The collector applies global and per-metric time series limits to prevent excessive cardinality.

Performance Impact

The impact depends on dcgm-exporter field selection and resulting series cardinality.

Setup

You can configure the dcgm collector in two ways:

Method	Best for	How to
UI	Fast setup without editing files	Go to Nodes → Configure this node → Collectors → Jobs, search for dcgm, then click + to add a job.
File	If you prefer configuring via file, or need to automate deployments (e.g., with Ansible)	Edit `go.d/dcgm.conf` and add a job.

:::important

UI configuration requires paid Netdata Cloud plan.

:::

Prerequisites

Run dcgm-exporter

Install DCGM and run dcgm-exporter so that a Prometheus endpoint is available (default :9400/metrics).

Configure exporter field list

The default exporter profile exposes a small subset of fields. Use the Netdata recommended profile: dcgm-exporter-netdata.csv (raw download: https://raw.githubusercontent.com/netdata/netdata/master/src/go/plugin/go.d/collector/dcgm/dcgm-exporter-netdata.csv).

The Netdata profile enables 127 fields by default and documents all remaining known DCGM fields as commented entries. To customize beyond the baseline, uncomment the field you need and comment one currently enabled field.

Runtime validation artifact: src/go/plugin/go.d/collector/dcgm/runtime-validation-driver-590.48.01-dcgm-exporter-4.4.1-4.5.2.md and src/go/plugin/go.d/collector/dcgm/runtime-validation-driver-590.48.01-dcgm-exporter-4.4.1-4.5.2.json

Validation is primarily version-scoped (NVIDIA driver + DCGM/DCGM-exporter versions), so treat it as a strong baseline rather than universal compatibility.

Example: dcgm-exporter -f /path/to/dcgm-exporter-netdata.csv

Keep collection intervals aligned

Set Netdata update_every to the same value as dcgm-exporter collection interval (default 30 seconds). Example exporter interval: dcgm-exporter -c 30000 and Netdata update_every: 30.

Enable profiling capabilities (optional)

Profiling fields may require additional privileges/capabilities in your runtime environment.

Configuration

Options

The following options can be defined globally: update_every, autodetection_retry.

Group	Option	Description	Default	Required
Collection	update_every	Data collection interval (seconds). Keep this aligned with dcgm-exporter collection interval.	30	no
	autodetection_retry	Autodetection retry interval (seconds). Set 0 to disable.	0	no
Target	url	DCGM exporter metrics endpoint URL.	http://127.0.0.1:9400/metrics	yes
	timeout	HTTP request timeout (seconds).	10	no
Limits	max_time_series	Global time series limit. If exceeded, collection is skipped for this cycle.	2000	no
	max_time_series_per_metric	Per-metric time series limit. Metrics above this limit are skipped.	200	no
HTTP Auth	username	Username for Basic HTTP authentication.		no
	password	Password for Basic HTTP authentication.		no
	bearer_token_file	Path to a file containing a bearer token.		no
TLS	tls_skip_verify	Skip TLS certificate and hostname verification (insecure).	no	no
	tls_ca	Path to CA bundle used to validate the server certificate.		no
	tls_cert	Path to client TLS certificate (for mTLS).		no
	tls_key	Path to client TLS private key (for mTLS).		no
Proxy	proxy_url	HTTP proxy URL.		no
	proxy_username	Username for proxy authentication.		no
	proxy_password	Password for proxy authentication.		no
Request	headers	Additional HTTP headers to include in the request.		no
	method	HTTP method.	GET	no
	body	HTTP request body.		no
	not_follow_redirects	Do not follow HTTP redirects.	no	no
	force_http2	Force HTTP/2 (including h2c over TCP).	no	no
Virtual Node	vnode	Associate this job with a Virtual Node.		no

via UI

Configure the dcgm collector from the Netdata web interface:

Go to Nodes.
Select the node where you want the dcgm data-collection job to run and click the :gear: (Configure this node). That node will run the data collection.
The Collectors → Jobs view opens by default.
In the Search box, type dcgm (or scroll the list) to locate the dcgm collector.
Click the + next to the dcgm collector to add a new job.
Fill in the job fields, then click Test to verify the configuration and Submit to save.
- Test runs the job with the provided settings and shows whether data can be collected.
- If it fails, an error message appears with details (for example, connection refused, timeout, or command execution errors), so you can adjust and retest.

via File

The configuration file name for this integration is go.d/dcgm.conf.

The file format is YAML. Generally, the structure is:

update_every: 1
autodetection_retry: 0
jobs:
  - name: some_name1
  - name: some_name2

You can edit the configuration file using the edit-config script from the Netdata config directory.

cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata
sudo ./edit-config go.d/dcgm.conf

Examples

Local exporter

Collect metrics from a local dcgm-exporter endpoint.

jobs:
  - name: local
    url: http://127.0.0.1:9400/metrics
    update_every: 30

TLS endpoint

Collect metrics over HTTPS with custom CA certificate.

jobs:
  - name: secure
    url: https://dcgm-exporter.example.com:9400/metrics
    update_every: 30
    tls_ca: /etc/netdata/certs/dcgm-ca.crt

Increased cardinality limits

Increase limits when collecting large field sets and multiple entities.

jobs:
  - name: dcgm_large
    url: http://127.0.0.1:9400/metrics
    update_every: 30
    max_time_series: 10000
    max_time_series_per_metric: 2000

Metrics

Metrics grouped by scope.

The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.

Metrics are grouped into static Netdata contexts. Contexts are created only when matching DCGM fields are present in the exporter output.

Per gpu

These metrics refer to GPU device instances.

Labels:

Label	Description
gpu	gpu label from exporter metrics.
uuid	uuid label from exporter metrics.

Metrics:

Metric	Dimensions	Unit
dcgm.gpu.capability.support	cc_mode, cuda_compute_capability, gpm_support, mig_attributes, mig_ci_info, mig_gi_info, mig_max_slices, supported_clocks, supported_type_info	state
dcgm.gpu.clock.frequency	app_mem_clock, app_sm_clock, max_mem_clock, max_sm_clock, max_video_clock, memory, sm, video_clock	MHz
dcgm.gpu.compute.activity	dram, fp16, fp32, fp64, graphics_engine_active, integer, sm_active, sm_occupancy, tensor	%
dcgm.gpu.compute.tensor.activity	tensor_dfma, tensor_hmma, tensor_imma	%
dcgm.gpu.compute.media.activity	nvdec0_active, nvdec1_active, nvdec2_active, nvdec3_active, nvdec4_active, nvdec5_active, nvdec6_active, nvdec7_active, nvjpg0_active, nvjpg1_active, nvjpg2_active, nvjpg3_active, nvjpg4_active, nvjpg5_active, nvjpg6_active, nvjpg7_active, nvofa0_active, nvofa1_active	%
dcgm.gpu.compute.cache.activity	hostmem_cache_hit, hostmem_cache_miss, peermem_cache_hit, peermem_cache_miss	events/s
dcgm.gpu.compute.utilization	decoder, encoder, gpu, memory_copy	%
dcgm.gpu.cpu.power	module_power_util_current, sysio_power_util_current	Watts
dcgm.gpu.cpu.info	cpu_model, cpu_vendor	value
dcgm.gpu.diagnostics.results	diag_diagnostic_result, diag_eud_result, diag_memory_bandwidth_result, diag_memory_result, diag_memtest_result, diag_nccl_tests_result, diag_nvbandwidth_result, diag_pulse_test_result, diag_software_result, diag_targeted_power_result, diag_targeted_stress_result	state
dcgm.gpu.diagnostics.status	diag_status	state
dcgm.gpu.health.status	imex_daemon_status, imex_domain_status	state
dcgm.gpu.interconnect.connectx.error_status	connectx_correctable_err_mask, connectx_correctable_err_status, connectx_uncorrectable_err_mask, connectx_uncorrectable_err_severity, connectx_uncorrectable_err_status	state
dcgm.gpu.interconnect.connectx.errors	connectx_correctable_err_mask, connectx_correctable_err_status, connectx_uncorrectable_err_mask, connectx_uncorrectable_err_severity, connectx_uncorrectable_err_status	errors/s
dcgm.gpu.interconnect.connectx.link	connectx_active_pcie_link_speed, connectx_expect_pcie_link_speed	value
dcgm.gpu.interconnect.connectx.status	connectx_health	state
dcgm.gpu.interconnect.error_rate	c2c_link_error_intr, c2c_link_error_replay, c2c_link_error_replay_b2b	errors/s
dcgm.gpu.interconnect.fabric	fabric_clique_id, fabric_cluster_uuid, fabric_health_mask, fabric_manager_error_code, fabric_manager_status	state
dcgm.gpu.interconnect.nvlink.error_rate	gpu_nvlink_errors	errors/s
dcgm.gpu.interconnect.pcie.error_rate	pcie_count_correctable_errors, pcie_replay	errors/s
dcgm.gpu.interconnect.pcie.link.generation	link_gen, max_link_gen	generation
dcgm.gpu.interconnect.pcie.link.width	connectx_active_pcie_link_width, connectx_expect_pcie_link_width, link_width, max_link_width	lanes
dcgm.gpu.interconnect.state	c2c_link, c2c_link_power_state, c2c_link_status	state
dcgm.gpu.interconnect.pcie.state	diag_pcie_result	state
dcgm.gpu.interconnect.throughput	c2c_max_bandwidth, c2c_rx_all_bytes, c2c_rx_data_bytes, c2c_tx_all_bytes, c2c_tx_data_bytes	B/s
dcgm.gpu.interconnect.pcie.throughput	pcie_rx, pcie_rx_throughput, pcie_tx, pcie_tx_throughput	B/s
dcgm.gpu.interconnect.nvlink.throughput	nvlink_rx, nvlink_tx	B/s
dcgm.gpu.interconnect.total.throughput	pcie, nvlink	B/s
dcgm.gpu.internal.boundary	first_connectx_field_id, first_vgpu_field_id, internal_fields_0_end, internal_fields_0_start, last_connectx_field_id, last_vgpu_field_id	state
dcgm.gpu.inventory.identity	brand, count, cuda_visible_devices_str, minor_number, name, nvml_index, serial, uuid	value
dcgm.gpu.inventory.platform	platform_chassis_serial_number, platform_chassis_slot_number, platform_host_id, platform_infiniband_guid, platform_module_id, platform_peer_type, platform_tray_index	value
dcgm.gpu.inventory.software	inforom_config_check, inforom_config_valid, inforom_image_ver, oem_inforom_ver, power_inforom_ver, process_name, vbios_version	value
dcgm.gpu.memory.bar1_usage	free, used	B
dcgm.gpu.memory.bar1_capacity	total	B
dcgm.gpu.memory.ecc_error_rate	ecc_current, ecc_dbe_agg, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_pending, ecc_sbe_agg, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_tex	errors/s
dcgm.gpu.memory.ecc_errors	ecc_current, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_inforom_ver, ecc_pending, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_tex	errors
dcgm.gpu.memory.page_retirements	retired_dbe, retired_pending, retired_sbe	pages/s
dcgm.gpu.memory.usage	free, reserved, used	B
dcgm.gpu.memory.capacity	total	B
dcgm.gpu.memory.utilization	used_percent	%
dcgm.gpu.power.energy	total	mJ/s
dcgm.gpu.power.profiles	enforced_power_profile_mask, requested_power_profile_mask, valid_power_profile_mask	state
dcgm.gpu.power.smoothing	pwr_smoothing_active_preset_profile, pwr_smoothing_admin_override_percent_tmp_floor, pwr_smoothing_admin_override_ramp_down_hyst_val, pwr_smoothing_admin_override_ramp_down_rate, pwr_smoothing_admin_override_ramp_up_rate, pwr_smoothing_applied_tmp_ceil, pwr_smoothing_applied_tmp_floor, pwr_smoothing_enabled, pwr_smoothing_hw_circuitry_percent_lifetime_remaining, pwr_smoothing_imm_ramp_down_enabled, pwr_smoothing_max_num_preset_profiles, pwr_smoothing_max_percent_tmp_floor_setting, pwr_smoothing_min_percent_tmp_floor_setting, pwr_smoothing_priv_lvl, pwr_smoothing_profile_percent_tmp_floor, pwr_smoothing_profile_ramp_down_hyst_val, pwr_smoothing_profile_ramp_down_rate, pwr_smoothing_profile_ramp_up_rate	value
dcgm.gpu.power.usage	draw, enforced_limit, power_mgmt_limit, power_mgmt_limit_def, power_mgmt_limit_max, power_mgmt_limit_min, power_usage_instant	Watts
dcgm.gpu.reliability.memory_health	banks_remap_rows_avail_high, banks_remap_rows_avail_low, banks_remap_rows_avail_max, banks_remap_rows_avail_none, banks_remap_rows_avail_partial, memory_unrepairable_flag, threshold_srm	state
dcgm.gpu.reliability.recovery_action	get_gpu_recovery_action	state
dcgm.gpu.reliability.row_remap_events	correctable_remapped_rows, uncorrectable_remapped_rows	rows/s
dcgm.gpu.reliability.row_remap_status	row_remap_failure, row_remap_pending	state
dcgm.gpu.reliability.xid	xid	code
dcgm.gpu.state.configuration	autoboost, compute_mode, persistence_mode, sync_boost, sync_boost_violation	state
dcgm.gpu.state.performance	pstate	state
dcgm.gpu.state.virtualization	mig_mode, virtual_mode	state
dcgm.gpu.thermal.fan_speed	fan_speed	%
dcgm.gpu.thermal.temperature	connectx_device_temperature, gpu, gpu_max_op_temp, gpu_temp_limit, mem_max_op_temp, memory, shutdown_temp, slowdown_temp	Celsius
dcgm.gpu.throttle.reasons	clocks_event_reasons	bitmask
dcgm.gpu.throttle.violations	board_limit_violation, hw_power_brake_slowdown, hw_therm_slowdown, low_utilization_violation, power_violation, reliability_violation, sw_power_cap, sw_therm_slowdown, sync_boost, thermal_violation, total_app_clocks_violation, total_base_clocks_violation	milliseconds/s
dcgm.gpu.topology.affinity	cpu_affinity_0, cpu_affinity_1, cpu_affinity_2, cpu_affinity_3, gpu_topology_affinity, gpu_topology_pci, mem_affinity_0, mem_affinity_1, mem_affinity_2, mem_affinity_3, pci_busid, pci_combined_id, pci_subsys_id	value
dcgm.gpu.virtualization.vgpu.frame_rate	vgpu_frame_rate_limit	fps
dcgm.gpu.virtualization.vgpu.instance	vgpu_instance_ids, vgpu_pci_id, vgpu_uuid	value
dcgm.gpu.virtualization.vgpu.license	vgpu_instance_license_state, vgpu_license_status, vgpu_type_license	state
dcgm.gpu.virtualization.vgpu.memory	vgpu_memory_usage	B
dcgm.gpu.virtualization.vgpu.sessions	vgpu_enc_sessions_info, vgpu_enc_stats, vgpu_fbc_sessions_info, vgpu_fbc_stats	value
dcgm.gpu.virtualization.vgpu.software	vgpu_driver_version	value
dcgm.gpu.virtualization.vgpu.type	creatable_vgpu_type_ids, supported_vgpu_type_ids, vgpu_type, vgpu_type_class, vgpu_type_info, vgpu_type_name	value
dcgm.gpu.virtualization.vgpu.utilization	vgpu_per_process_utilization	%
dcgm.gpu.virtualization.vgpu.vm	vgpu_vm_gpu_instance_id, vgpu_vm_id, vgpu_vm_name	value
dcgm.gpu.workload.sessions	accounting_data, enc_stats, fbc_sessions_info, fbc_stats	value

Per mig

These metrics refer to MIG instances.

Labels:

Label	Description
gpu	gpu label from exporter metrics.
gpu_i_id	gpu_i_id label from exporter metrics.
gpu_i_profile	gpu_i_profile label from exporter metrics.

Metrics:

Metric	Dimensions	Unit
dcgm.mig.clock.frequency	app_mem_clock, app_sm_clock, max_mem_clock, max_sm_clock, max_video_clock, memory, sm, video_clock	MHz
dcgm.mig.compute.activity	dram, fp16, fp32, fp64, graphics_engine_active, integer, sm_active, sm_occupancy, tensor	%
dcgm.mig.compute.tensor.activity	tensor_dfma, tensor_hmma, tensor_imma	%
dcgm.mig.compute.media.activity	nvdec0_active, nvdec1_active, nvdec2_active, nvdec3_active, nvdec4_active, nvdec5_active, nvdec6_active, nvdec7_active, nvjpg0_active, nvjpg1_active, nvjpg2_active, nvjpg3_active, nvjpg4_active, nvjpg5_active, nvjpg6_active, nvjpg7_active, nvofa0_active, nvofa1_active	%
dcgm.mig.compute.cache.activity	hostmem_cache_hit, hostmem_cache_miss, peermem_cache_hit, peermem_cache_miss	events/s
dcgm.mig.compute.utilization	decoder, encoder, gpu, memory_copy	%
dcgm.mig.interconnect.nvlink.ber	nvlink_count_effective_ber, nvlink_count_effective_ber_float, nvlink_count_symbol_ber, nvlink_count_symbol_ber_float	ratio
dcgm.mig.interconnect.nvlink.congestion	nvlink_ppcnt_ibpc_port_xmit_wait	events/s
dcgm.mig.interconnect.error_rate	c2c_link_error_intr, c2c_link_error_replay, c2c_link_error_replay_b2b	errors/s
dcgm.mig.interconnect.nvlink.error_rate	gpu_nvlink_errors, nvlink_count_effective_errors, nvlink_count_fec_history_0, nvlink_count_fec_history_1, nvlink_count_fec_history_10, nvlink_count_fec_history_11, nvlink_count_fec_history_12, nvlink_count_fec_history_13, nvlink_count_fec_history_14, nvlink_count_fec_history_15, nvlink_count_fec_history_2, nvlink_count_fec_history_3, nvlink_count_fec_history_4, nvlink_count_fec_history_5, nvlink_count_fec_history_6, nvlink_count_fec_history_7, nvlink_count_fec_history_8, nvlink_count_fec_history_9, nvlink_count_link_recovery_events, nvlink_count_link_recovery_failed_events, nvlink_count_link_recovery_successful_events, nvlink_count_local_link_integrity_errors, nvlink_count_rx_buffer_overrun_errors, nvlink_count_rx_errors, nvlink_count_rx_general_errors, nvlink_count_rx_malformed_packet_errors, nvlink_count_rx_remote_errors, nvlink_count_rx_symbol_errors, nvlink_count_tx_discards, nvlink_crc_data_error, nvlink_crc_data_error_count_l0, nvlink_crc_data_error_count_l1, nvlink_crc_data_error_count_l10, nvlink_crc_data_error_count_l11, nvlink_crc_data_error_count_l12, nvlink_crc_data_error_count_l13, nvlink_crc_data_error_count_l14, nvlink_crc_data_error_count_l15, nvlink_crc_data_error_count_l16, nvlink_crc_data_error_count_l17, nvlink_crc_data_error_count_l2, nvlink_crc_data_error_count_l3, nvlink_crc_data_error_count_l4, nvlink_crc_data_error_count_l5, nvlink_crc_data_error_count_l6, nvlink_crc_data_error_count_l7, nvlink_crc_data_error_count_l8, nvlink_crc_data_error_count_l9, nvlink_crc_flit_error, nvlink_crc_flit_error_count_l0, nvlink_crc_flit_error_count_l1, nvlink_crc_flit_error_count_l10, nvlink_crc_flit_error_count_l11, nvlink_crc_flit_error_count_l12, nvlink_crc_flit_error_count_l13, nvlink_crc_flit_error_count_l14, nvlink_crc_flit_error_count_l15, nvlink_crc_flit_error_count_l16, nvlink_crc_flit_error_count_l17, nvlink_crc_flit_error_count_l2, nvlink_crc_flit_error_count_l3, nvlink_crc_flit_error_count_l4, nvlink_crc_flit_error_count_l5, nvlink_crc_flit_error_count_l6, nvlink_crc_flit_error_count_l7, nvlink_crc_flit_error_count_l8, nvlink_crc_flit_error_count_l9, nvlink_error_dl_crc, nvlink_error_dl_recovery, nvlink_error_dl_replay, nvlink_ppcnt_physical_successful_recovery_events, nvlink_ppcnt_plr_rcv_uncorrectable_code, nvlink_ppcnt_recovery_time_since_last, nvlink_ppcnt_recovery_total_successful_events, nvlink_pprm_oper_recovery, nvlink_recovery_error, nvlink_recovery_error_count_l0, nvlink_recovery_error_count_l1, nvlink_recovery_error_count_l10, nvlink_recovery_error_count_l11, nvlink_recovery_error_count_l12, nvlink_recovery_error_count_l13, nvlink_recovery_error_count_l14, nvlink_recovery_error_count_l15, nvlink_recovery_error_count_l16, nvlink_recovery_error_count_l17, nvlink_recovery_error_count_l2, nvlink_recovery_error_count_l3, nvlink_recovery_error_count_l4, nvlink_recovery_error_count_l5, nvlink_recovery_error_count_l6, nvlink_recovery_error_count_l7, nvlink_recovery_error_count_l8, nvlink_recovery_error_count_l9, nvlink_replay_error, nvlink_replay_error_count_l0, nvlink_replay_error_count_l1, nvlink_replay_error_count_l10, nvlink_replay_error_count_l11, nvlink_replay_error_count_l12, nvlink_replay_error_count_l13, nvlink_replay_error_count_l14, nvlink_replay_error_count_l15, nvlink_replay_error_count_l16, nvlink_replay_error_count_l17, nvlink_replay_error_count_l2, nvlink_replay_error_count_l3, nvlink_replay_error_count_l4, nvlink_replay_error_count_l5, nvlink_replay_error_count_l6, nvlink_replay_error_count_l7, nvlink_replay_error_count_l8, nvlink_replay_error_count_l9	errors/s
dcgm.mig.interconnect.pcie.error_rate	pcie_count_correctable_errors, pcie_replay	errors/s
dcgm.mig.interconnect.nvlink.errors	nvlink_ppcnt_plr_rcv_uncorrectable_code	errors
dcgm.mig.interconnect.fabric	fabric_clique_id, fabric_cluster_uuid, fabric_health_mask, fabric_manager_error_code, fabric_manager_status	state
dcgm.mig.interconnect.pcie.link.generation	link_gen, max_link_gen	generation
dcgm.mig.interconnect.pcie.link.width	link_width, max_link_width	lanes
dcgm.mig.interconnect.state	c2c_link, c2c_link_power_state, c2c_link_status	state
dcgm.mig.interconnect.pcie.state	diag_pcie_result	state
dcgm.mig.interconnect.nvlink.state	gpu_topology_nvlink, nvlink_get_state, nvlink_ppcnt_physical_link_down_counter, nvlink_ppcnt_plr_rcv_code_err, nvlink_ppcnt_plr_sync_events, nvlink_ppcnt_plr_xmit_retry_events, p2p_nvlink_status	state
dcgm.mig.interconnect.throughput	c2c_max_bandwidth, c2c_rx_all_bytes, c2c_rx_data_bytes, c2c_tx_all_bytes, c2c_tx_data_bytes	B/s
dcgm.mig.interconnect.nvlink.throughput	nvlink_bandwidth_l0, nvlink_bandwidth_l1, nvlink_bandwidth_l10, nvlink_bandwidth_l11, nvlink_bandwidth_l12, nvlink_bandwidth_l13, nvlink_bandwidth_l14, nvlink_bandwidth_l15, nvlink_bandwidth_l16, nvlink_bandwidth_l17, nvlink_bandwidth_l2, nvlink_bandwidth_l3, nvlink_bandwidth_l4, nvlink_bandwidth_l5, nvlink_bandwidth_l6, nvlink_bandwidth_l7, nvlink_bandwidth_l8, nvlink_bandwidth_l9, nvlink_count_rx, nvlink_count_tx, nvlink_l0_rx, nvlink_l0_tx, nvlink_l10_rx, nvlink_l10_tx, nvlink_l11_rx, nvlink_l11_tx, nvlink_l12_rx, nvlink_l12_tx, nvlink_l13_rx, nvlink_l13_tx, nvlink_l14_rx, nvlink_l14_tx, nvlink_l15_rx, nvlink_l15_tx, nvlink_l16_rx, nvlink_l16_tx, nvlink_l17_rx, nvlink_l17_tx, nvlink_l1_rx, nvlink_l1_tx, nvlink_l2_rx, nvlink_l2_tx, nvlink_l3_rx, nvlink_l3_tx, nvlink_l4_rx, nvlink_l4_tx, nvlink_l5_rx, nvlink_l5_tx, nvlink_l6_rx, nvlink_l6_tx, nvlink_l7_rx, nvlink_l7_tx, nvlink_l8_rx, nvlink_l8_tx, nvlink_l9_rx, nvlink_l9_tx, nvlink_rx_bandwidth, nvlink_rx_bandwidth_l0, nvlink_rx_bandwidth_l1, nvlink_rx_bandwidth_l10, nvlink_rx_bandwidth_l11, nvlink_rx_bandwidth_l12, nvlink_rx_bandwidth_l13, nvlink_rx_bandwidth_l14, nvlink_rx_bandwidth_l15, nvlink_rx_bandwidth_l16, nvlink_rx_bandwidth_l17, nvlink_rx_bandwidth_l2, nvlink_rx_bandwidth_l3, nvlink_rx_bandwidth_l4, nvlink_rx_bandwidth_l5, nvlink_rx_bandwidth_l6, nvlink_rx_bandwidth_l7, nvlink_rx_bandwidth_l8, nvlink_rx_bandwidth_l9, nvlink_rx, nvlink_tx_bandwidth, nvlink_tx_bandwidth_l0, nvlink_tx_bandwidth_l1, nvlink_tx_bandwidth_l10, nvlink_tx_bandwidth_l11, nvlink_tx_bandwidth_l12, nvlink_tx_bandwidth_l13, nvlink_tx_bandwidth_l14, nvlink_tx_bandwidth_l15, nvlink_tx_bandwidth_l16, nvlink_tx_bandwidth_l17, nvlink_tx_bandwidth_l2, nvlink_tx_bandwidth_l3, nvlink_tx_bandwidth_l4, nvlink_tx_bandwidth_l5, nvlink_tx_bandwidth_l6, nvlink_tx_bandwidth_l7, nvlink_tx_bandwidth_l8, nvlink_tx_bandwidth_l9, nvlink_tx	B/s
dcgm.mig.interconnect.pcie.throughput	pcie_rx, pcie_rx_throughput, pcie_tx, pcie_tx_throughput	B/s
dcgm.mig.interconnect.total.throughput	pcie, nvlink	B/s
dcgm.mig.interconnect.nvlink.traffic	nvlink_count_rx_packets, nvlink_count_tx_packets, nvlink_ppcnt_plr_rcv_codes, nvlink_ppcnt_plr_xmit_codes, nvlink_ppcnt_plr_xmit_retry_codes	events/s
dcgm.mig.memory.bar1_usage	free, used	B
dcgm.mig.memory.bar1_capacity	total	B
dcgm.mig.memory.ecc_error_rate	ecc_current, ecc_dbe_agg, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_pending, ecc_sbe_agg, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_tex, nvlink_ecc_data_error	errors/s
dcgm.mig.memory.ecc_errors	ecc_current, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_inforom_ver, ecc_pending, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_tex	errors
dcgm.mig.memory.page_retirements	retired_dbe, retired_pending, retired_sbe	pages/s
dcgm.mig.memory.usage	free, reserved, used	B
dcgm.mig.memory.capacity	total	B
dcgm.mig.memory.utilization	used_percent	%
dcgm.mig.power.energy	total	mJ/s
dcgm.mig.power.profiles	enforced_power_profile_mask, requested_power_profile_mask, valid_power_profile_mask	state
dcgm.mig.power.smoothing	pwr_smoothing_active_preset_profile, pwr_smoothing_admin_override_percent_tmp_floor, pwr_smoothing_admin_override_ramp_down_hyst_val, pwr_smoothing_admin_override_ramp_down_rate, pwr_smoothing_admin_override_ramp_up_rate, pwr_smoothing_applied_tmp_ceil, pwr_smoothing_applied_tmp_floor, pwr_smoothing_enabled, pwr_smoothing_hw_circuitry_percent_lifetime_remaining, pwr_smoothing_imm_ramp_down_enabled, pwr_smoothing_max_num_preset_profiles, pwr_smoothing_max_percent_tmp_floor_setting, pwr_smoothing_min_percent_tmp_floor_setting, pwr_smoothing_priv_lvl, pwr_smoothing_profile_percent_tmp_floor, pwr_smoothing_profile_ramp_down_hyst_val, pwr_smoothing_profile_ramp_down_rate, pwr_smoothing_profile_ramp_up_rate	value
dcgm.mig.power.usage	draw, enforced_limit, power_mgmt_limit, power_mgmt_limit_def, power_mgmt_limit_max, power_mgmt_limit_min, power_usage_instant	Watts
dcgm.mig.reliability.memory_health	banks_remap_rows_avail_high, banks_remap_rows_avail_low, banks_remap_rows_avail_max, banks_remap_rows_avail_none, banks_remap_rows_avail_partial, memory_unrepairable_flag, threshold_srm	state
dcgm.mig.reliability.recovery_action	get_gpu_recovery_action	state
dcgm.mig.reliability.row_remap_events	correctable_remapped_rows, uncorrectable_remapped_rows	rows/s
dcgm.mig.reliability.row_remap_status	row_remap_failure, row_remap_pending	state
dcgm.mig.reliability.xid	xid	code
dcgm.mig.state.configuration	autoboost, compute_mode, persistence_mode, sync_boost, sync_boost_violation	state
dcgm.mig.state.performance	pstate	state
dcgm.mig.state.virtualization	mig_mode, virtual_mode	state
dcgm.mig.thermal.fan_speed	fan_speed	%
dcgm.mig.thermal.temperature	gpu, gpu_max_op_temp, gpu_temp_limit, mem_max_op_temp, memory, shutdown_temp, slowdown_temp	Celsius
dcgm.mig.throttle.reasons	clocks_event_reasons	bitmask
dcgm.mig.throttle.violations	board_limit_violation, hw_power_brake_slowdown, hw_therm_slowdown, low_utilization_violation, power_violation, reliability_violation, sw_power_cap, sw_therm_slowdown, sync_boost, thermal_violation, total_app_clocks_violation, total_base_clocks_violation	milliseconds/s

Per nvlink

These metrics refer to NVLink link instances.

Labels:

Label	Description
gpu	gpu label from exporter metrics.
gpu_uuid	gpu_uuid label from exporter metrics.
nvlink	nvlink label from exporter metrics.

Metrics:

Metric	Dimensions	Unit
dcgm.nvlink.interconnect.ber	nvlink_count_effective_ber, nvlink_count_effective_ber_float, nvlink_count_symbol_ber, nvlink_count_symbol_ber_float	ratio
dcgm.nvlink.interconnect.congestion	nvlink_ppcnt_ibpc_port_xmit_wait	events/s
dcgm.nvlink.interconnect.error_rate	gpu_nvlink_errors, nvlink_count_effective_errors, nvlink_count_fec_history_0, nvlink_count_fec_history_1, nvlink_count_fec_history_10, nvlink_count_fec_history_11, nvlink_count_fec_history_12, nvlink_count_fec_history_13, nvlink_count_fec_history_14, nvlink_count_fec_history_15, nvlink_count_fec_history_2, nvlink_count_fec_history_3, nvlink_count_fec_history_4, nvlink_count_fec_history_5, nvlink_count_fec_history_6, nvlink_count_fec_history_7, nvlink_count_fec_history_8, nvlink_count_fec_history_9, nvlink_count_link_recovery_events, nvlink_count_link_recovery_failed_events, nvlink_count_link_recovery_successful_events, nvlink_count_local_link_integrity_errors, nvlink_count_rx_buffer_overrun_errors, nvlink_count_rx_errors, nvlink_count_rx_general_errors, nvlink_count_rx_malformed_packet_errors, nvlink_count_rx_remote_errors, nvlink_count_rx_symbol_errors, nvlink_count_tx_discards, nvlink_crc_data_error, nvlink_crc_data_error_count_l0, nvlink_crc_data_error_count_l1, nvlink_crc_data_error_count_l10, nvlink_crc_data_error_count_l11, nvlink_crc_data_error_count_l12, nvlink_crc_data_error_count_l13, nvlink_crc_data_error_count_l14, nvlink_crc_data_error_count_l15, nvlink_crc_data_error_count_l16, nvlink_crc_data_error_count_l17, nvlink_crc_data_error_count_l2, nvlink_crc_data_error_count_l3, nvlink_crc_data_error_count_l4, nvlink_crc_data_error_count_l5, nvlink_crc_data_error_count_l6, nvlink_crc_data_error_count_l7, nvlink_crc_data_error_count_l8, nvlink_crc_data_error_count_l9, nvlink_crc_flit_error, nvlink_crc_flit_error_count_l0, nvlink_crc_flit_error_count_l1, nvlink_crc_flit_error_count_l10, nvlink_crc_flit_error_count_l11, nvlink_crc_flit_error_count_l12, nvlink_crc_flit_error_count_l13, nvlink_crc_flit_error_count_l14, nvlink_crc_flit_error_count_l15, nvlink_crc_flit_error_count_l16, nvlink_crc_flit_error_count_l17, nvlink_crc_flit_error_count_l2, nvlink_crc_flit_error_count_l3, nvlink_crc_flit_error_count_l4, nvlink_crc_flit_error_count_l5, nvlink_crc_flit_error_count_l6, nvlink_crc_flit_error_count_l7, nvlink_crc_flit_error_count_l8, nvlink_crc_flit_error_count_l9, nvlink_error_dl_crc, nvlink_error_dl_recovery, nvlink_error_dl_replay, nvlink_ppcnt_physical_successful_recovery_events, nvlink_ppcnt_plr_rcv_uncorrectable_code, nvlink_ppcnt_recovery_time_since_last, nvlink_ppcnt_recovery_total_successful_events, nvlink_pprm_oper_recovery, nvlink_recovery_error, nvlink_recovery_error_count_l0, nvlink_recovery_error_count_l1, nvlink_recovery_error_count_l10, nvlink_recovery_error_count_l11, nvlink_recovery_error_count_l12, nvlink_recovery_error_count_l13, nvlink_recovery_error_count_l14, nvlink_recovery_error_count_l15, nvlink_recovery_error_count_l16, nvlink_recovery_error_count_l17, nvlink_recovery_error_count_l2, nvlink_recovery_error_count_l3, nvlink_recovery_error_count_l4, nvlink_recovery_error_count_l5, nvlink_recovery_error_count_l6, nvlink_recovery_error_count_l7, nvlink_recovery_error_count_l8, nvlink_recovery_error_count_l9, nvlink_replay_error, nvlink_replay_error_count_l0, nvlink_replay_error_count_l1, nvlink_replay_error_count_l10, nvlink_replay_error_count_l11, nvlink_replay_error_count_l12, nvlink_replay_error_count_l13, nvlink_replay_error_count_l14, nvlink_replay_error_count_l15, nvlink_replay_error_count_l16, nvlink_replay_error_count_l17, nvlink_replay_error_count_l2, nvlink_replay_error_count_l3, nvlink_replay_error_count_l4, nvlink_replay_error_count_l5, nvlink_replay_error_count_l6, nvlink_replay_error_count_l7, nvlink_replay_error_count_l8, nvlink_replay_error_count_l9	errors/s
dcgm.nvlink.interconnect.errors	nvlink_ppcnt_plr_rcv_uncorrectable_code	errors
dcgm.nvlink.interconnect.state	gpu_topology_nvlink, nvlink_get_state, nvlink_ppcnt_physical_link_down_counter, nvlink_ppcnt_plr_rcv_code_err, nvlink_ppcnt_plr_sync_events, nvlink_ppcnt_plr_xmit_retry_events, p2p_nvlink_status	state
dcgm.nvlink.interconnect.throughput	nvlink_bandwidth, nvlink_bandwidth_l0, nvlink_bandwidth_l1, nvlink_bandwidth_l10, nvlink_bandwidth_l11, nvlink_bandwidth_l12, nvlink_bandwidth_l13, nvlink_bandwidth_l14, nvlink_bandwidth_l15, nvlink_bandwidth_l16, nvlink_bandwidth_l17, nvlink_bandwidth_l2, nvlink_bandwidth_l3, nvlink_bandwidth_l4, nvlink_bandwidth_l5, nvlink_bandwidth_l6, nvlink_bandwidth_l7, nvlink_bandwidth_l8, nvlink_bandwidth_l9, nvlink_count_rx, nvlink_count_tx, nvlink_l0_rx, nvlink_l0_tx, nvlink_l10_rx, nvlink_l10_tx, nvlink_l11_rx, nvlink_l11_tx, nvlink_l12_rx, nvlink_l12_tx, nvlink_l13_rx, nvlink_l13_tx, nvlink_l14_rx, nvlink_l14_tx, nvlink_l15_rx, nvlink_l15_tx, nvlink_l16_rx, nvlink_l16_tx, nvlink_l17_rx, nvlink_l17_tx, nvlink_l1_rx, nvlink_l1_tx, nvlink_l2_rx, nvlink_l2_tx, nvlink_l3_rx, nvlink_l3_tx, nvlink_l4_rx, nvlink_l4_tx, nvlink_l5_rx, nvlink_l5_tx, nvlink_l6_rx, nvlink_l6_tx, nvlink_l7_rx, nvlink_l7_tx, nvlink_l8_rx, nvlink_l8_tx, nvlink_l9_rx, nvlink_l9_tx, nvlink_rx_bandwidth, nvlink_rx_bandwidth_l0, nvlink_rx_bandwidth_l1, nvlink_rx_bandwidth_l10, nvlink_rx_bandwidth_l11, nvlink_rx_bandwidth_l12, nvlink_rx_bandwidth_l13, nvlink_rx_bandwidth_l14, nvlink_rx_bandwidth_l15, nvlink_rx_bandwidth_l16, nvlink_rx_bandwidth_l17, nvlink_rx_bandwidth_l2, nvlink_rx_bandwidth_l3, nvlink_rx_bandwidth_l4, nvlink_rx_bandwidth_l5, nvlink_rx_bandwidth_l6, nvlink_rx_bandwidth_l7, nvlink_rx_bandwidth_l8, nvlink_rx_bandwidth_l9, nvlink_rx, nvlink_tx_bandwidth, nvlink_tx_bandwidth_l0, nvlink_tx_bandwidth_l1, nvlink_tx_bandwidth_l10, nvlink_tx_bandwidth_l11, nvlink_tx_bandwidth_l12, nvlink_tx_bandwidth_l13, nvlink_tx_bandwidth_l14, nvlink_tx_bandwidth_l15, nvlink_tx_bandwidth_l16, nvlink_tx_bandwidth_l17, nvlink_tx_bandwidth_l2, nvlink_tx_bandwidth_l3, nvlink_tx_bandwidth_l4, nvlink_tx_bandwidth_l5, nvlink_tx_bandwidth_l6, nvlink_tx_bandwidth_l7, nvlink_tx_bandwidth_l8, nvlink_tx_bandwidth_l9, nvlink_tx	B/s
dcgm.nvlink.interconnect.traffic	nvlink_count_rx_packets, nvlink_count_tx_packets, nvlink_ppcnt_plr_rcv_codes, nvlink_ppcnt_plr_xmit_codes, nvlink_ppcnt_plr_xmit_retry_codes	events/s
dcgm.nvlink.internal.boundary	nvlink_ppcnt_recovery_time_between_last_two	state
dcgm.nvlink.memory.ecc_error_rate	nvlink_ecc_data_error	errors/s

Per nvswitch

These metrics refer to NVSwitch instances.

Labels:

Label	Description
nvswitch	nvswitch label from exporter metrics.

Metrics:

Metric	Dimensions	Unit
dcgm.nvswitch.interconnect.nvswitch.current	nvswitch_current_iddq, nvswitch_current_iddq_dvdd, nvswitch_current_iddq_rev	value
dcgm.nvswitch.interconnect.nvswitch.errors	nvswitch_fatal_errors, nvswitch_link_crc_errors, nvswitch_link_crc_errors_lane0, nvswitch_link_crc_errors_lane1, nvswitch_link_crc_errors_lane2, nvswitch_link_crc_errors_lane3, nvswitch_link_crc_errors_lane4, nvswitch_link_crc_errors_lane5, nvswitch_link_crc_errors_lane6, nvswitch_link_crc_errors_lane7, nvswitch_link_fatal_errors, nvswitch_link_flit_errors, nvswitch_link_non_fatal_errors, nvswitch_link_recovery_errors, nvswitch_link_replay_errors, nvswitch_non_fatal_errors	errors/s
dcgm.nvswitch.interconnect.nvswitch.latency	nvswitch_link_latency_count_vc0, nvswitch_link_latency_count_vc1, nvswitch_link_latency_count_vc2, nvswitch_link_latency_count_vc3, nvswitch_link_latency_high_vc0, nvswitch_link_latency_high_vc1, nvswitch_link_latency_high_vc2, nvswitch_link_latency_high_vc3, nvswitch_link_latency_low_vc0, nvswitch_link_latency_low_vc1, nvswitch_link_latency_low_vc2, nvswitch_link_latency_low_vc3, nvswitch_link_latency_medium_vc0, nvswitch_link_latency_medium_vc1, nvswitch_link_latency_medium_vc2, nvswitch_link_latency_medium_vc3, nvswitch_link_latency_panic_vc0, nvswitch_link_latency_panic_vc1, nvswitch_link_latency_panic_vc2, nvswitch_link_latency_panic_vc3	events/s
dcgm.nvswitch.interconnect.nvswitch.power	nvswitch_power_dvdd, nvswitch_power_hvdd, nvswitch_power_vdd	Watts
dcgm.nvswitch.interconnect.nvswitch.status	nvswitch_link_status, nvswitch_link_type, nvswitch_reset_required	state
dcgm.nvswitch.interconnect.nvswitch.throughput	nvswitch_link_throughput_rx, nvswitch_link_throughput_tx, nvswitch_throughput_rx, nvswitch_throughput_tx	B/s
dcgm.nvswitch.interconnect.nvswitch.topology	nvswitch_device_uuid, nvswitch_link_device_link_id, nvswitch_link_device_link_sid, nvswitch_link_id, nvswitch_link_remote_pcie_bus, nvswitch_link_remote_pcie_device, nvswitch_link_remote_pcie_domain, nvswitch_link_remote_pcie_function, nvswitch_pcie_bus, nvswitch_pcie_device, nvswitch_pcie_domain, nvswitch_pcie_function, nvswitch_phys_id	value
dcgm.nvswitch.interconnect.nvswitch.voltage	nvswitch_voltage_mvolt	mV
dcgm.nvswitch.internal.boundary	first_nvswitch_field_id, last_nvswitch_field_id	state
dcgm.nvswitch.memory.ecc_error_rate	nvswitch_link_ecc_errors, nvswitch_link_ecc_errors_lane0, nvswitch_link_ecc_errors_lane1, nvswitch_link_ecc_errors_lane2, nvswitch_link_ecc_errors_lane3, nvswitch_link_ecc_errors_lane4, nvswitch_link_ecc_errors_lane5, nvswitch_link_ecc_errors_lane6, nvswitch_link_ecc_errors_lane7	errors/s
dcgm.nvswitch.thermal.temperature	nvswitch_temperature_current, nvswitch_temperature_limit_shutdown, nvswitch_temperature_limit_slowdown	Celsius

Per cpu

These metrics refer to host CPU instances.

Labels:

Label	Description
cpu	cpu label from exporter metrics.

Metrics:

Metric	Dimensions	Unit
dcgm.cpu.clock.frequency	cpu_clock_current	MHz
dcgm.cpu.cpu.info	cpu_model, cpu_vendor	value
dcgm.cpu.cpu.power	cpu_power_limit, cpu_power_util_current	Watts
dcgm.cpu.cpu.temperature	cpu_temp_critical, cpu_temp_current, cpu_temp_warning	Celsius
dcgm.cpu.cpu.utilization	cpu_util, cpu_util_irq, cpu_util_nice, cpu_util_sys, cpu_util_user	%
dcgm.cpu.diagnostics.results	diag_cpu_eud_result	state

Per cpu_core

These metrics refer to host CPU core instances.

Labels:

Label	Description
cpu	cpu label from exporter metrics.
cpucore	cpucore label from exporter metrics.

Metrics:

Metric	Dimensions	Unit
dcgm.cpu_core.clock.frequency	cpu_clock_current	MHz
dcgm.cpu_core.cpu.info	cpu_model, cpu_vendor	value
dcgm.cpu_core.cpu.power	cpu_power_limit, cpu_power_util_current	Watts
dcgm.cpu_core.cpu.temperature	cpu_temp_critical, cpu_temp_current, cpu_temp_warning	Celsius
dcgm.cpu_core.cpu.utilization	cpu_util, cpu_util_irq, cpu_util_nice, cpu_util_sys, cpu_util_user	%
dcgm.cpu_core.diagnostics.results	diag_cpu_eud_result	state

Per exporter

These metrics refer to exporter/global instances.

Labels:

Label	Description
job	job label from exporter metrics.

Metrics:

Metric	Dimensions	Unit
dcgm.exporter.health.status	bind_unbind_event	state
dcgm.exporter.inventory.software	cuda_driver_version, driver_version, nvml_version	value

Alerts

The following alerts are available:

Alert name	On metric	Description
dcgm_gpu_xid_errors	dcgm.gpu.reliability.xid	NVIDIA driver reported GPU XID error on GPU ${label:gpu}
dcgm_gpu_row_remap_failure	dcgm.gpu.reliability.row_remap_status	GPU row remapping failed on GPU ${label:gpu}
dcgm_gpu_uncorrectable_remapped_rows	dcgm.gpu.reliability.row_remap_events	Uncorrectable remapped rows increased on GPU ${label:gpu}
dcgm_gpu_power_violation	dcgm.gpu.throttle.violations	Power throttling detected on GPU ${label:gpu}
dcgm_gpu_thermal_violation	dcgm.gpu.throttle.violations	Thermal throttling detected on GPU ${label:gpu}

Troubleshooting

Debug Mode

Important: Debug mode is not supported for data collection jobs created via the UI using the Dyncfg feature.

To troubleshoot issues with the dcgm collector, run the go.d.plugin with the debug option enabled. The output should give you clues as to why the collector isn’t working.

Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that’s not the case on your system, open netdata.conf and look for the plugins setting under [directories].
```
cd /usr/libexec/netdata/plugins.d/
```
Switch to the netdata user.
```
sudo -u netdata -s
```
Run the go.d.plugin to debug the collector:
```
./go.d.plugin -d -m dcgm
```
To debug a specific job:
```
./go.d.plugin -d -m dcgm -j jobName
```

Getting Logs

If you’re encountering problems with the dcgm collector, follow these steps to retrieve logs and identify potential issues:

Run the command specific to your system (systemd, non-systemd, or Docker container).
Examine the output for any warnings or error messages that might indicate issues. These messages should provide clues about the root cause of the problem.

System with systemd

Use the following command to view logs generated since the last Netdata service restart:

journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep dcgm

System without systemd

Locate the collector log file, typically at /var/log/netdata/collector.log, and use grep to filter for collector’s name:

grep dcgm /var/log/netdata/collector.log

Note: This method shows logs from all restarts. Focus on the latest entries for troubleshooting current issues.

Docker Container

If your Netdata runs in a Docker container named “netdata” (replace if different), use this command:

docker logs netdata 2>&1 | grep dcgm