The only agent that thinks for itself

Autonomous Monitoring with self-learning AI built-in, operating independently across your entire stack.

Unlimited Metrics & Logs
Machine learning & MCP
5% CPU, 150MB RAM
3GB disk, >1 year retention
800+ integrations, zero config
Dashboards, alerts out of the box
> Discover Netdata Agents

Centralized metrics streaming and storage

Aggregate metrics from multiple agents into centralized Parent nodes for unified monitoring across your infrastructure.

Stream from unlimited agents
Long-term data retention
High availability clustering
Data replication & backup
Scalable architecture
Enterprise-grade security
> Learn about Parents

Fully managed cloud platform

Access your monitoring data from anywhere with our SaaS platform. No infrastructure to manage, automatic updates, and global availability.

Zero infrastructure management
99.9% uptime SLA
Global data centers
Automatic updates & patches
Enterprise SSO & RBAC
SOC2 & ISO certified
> Explore Netdata Cloud

Deploy Netdata Cloud in your infrastructure

Run the full Netdata Cloud platform on-premises for complete data sovereignty and compliance with your security policies.

Complete data sovereignty
Air-gapped deployment
Custom compliance controls
Private network integration
Dedicated support team
Kubernetes & Docker support
> Learn about Cloud On-Premises

Powerful, intuitive monitoring interface

Modern, responsive UI built for real-time troubleshooting with customizable dashboards and advanced visualization capabilities.

Real-time chart updates
Customizable dashboards
Dark & light themes
Advanced filtering & search
Responsive on all devices
Collaboration features
> Explore Netdata UI

Monitor on the go

Native iOS and Android apps bring full monitoring capabilities to your mobile device with real-time alerts and notifications.

iOS & Android apps
Push notifications
Touch-optimized interface
Offline data access
Biometric authentication
Widget support
> Download apps

Best energy efficiency

True real-time per-second

100% automated zero config

Centralized observability

Multi-year retention

High availability built-in

Zero maintenance

Always up-to-date

Enterprise security

Complete data control

Air-gap ready

Compliance certified

Millisecond responsiveness

Infinite zoom & pan

Works on any device

Native performance

Instant alerts

Monitor anywhere

80% Faster Incident Resolution

AI-powered troubleshooting from detection, to root cause and blast radius identification, to reporting.

True Real-Time and Simple, even at Scale

Linearly and infinitely scalable full-stack observability, that can be deployed even mid-crisis.

90% Cost Reduction, Full Fidelity

Instead of centralizing the data, Netdata distributes the code, eliminating pipelines and complexity.

Control Without Surrender

SOC 2 Type 2 certified with every metric kept on your infrastructure.

Integrations

800+ collectors and notification channels, auto-discovered and ready out of the box.

800+ data collectors
Auto-discovery & zero config
Cloud, infra, app protocols
Notifications out of the box
> Explore integrations
Real Results
46% Cost Reduction

Reduced monitoring costs by 46% while cutting staff overhead by 67%.

— Leonardo Antunez, Codyas

Zero Pipeline

No data shipping. No central storage costs. Query at the edge.

From Our Users
"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

No Query Language

Point-and-click troubleshooting. No PromQL, no LogQL, no learning curve.

Enterprise Ready
67% Less Staff, 46% Cost Cut

Enterprise efficiency without enterprise complexity—real ROI from day one.

— Leonardo Antunez, Codyas

SOC 2 Type 2 Certified

Zero data egress. Only metadata reaches the cloud. Your metrics stay on your infrastructure.

Full Coverage
800+ Collectors

Auto-discovered and configured. No manual setup required.

Any Notification Channel

Slack, PagerDuty, Teams, email, webhooks—all built-in.

Built for the People Who Get Paged

Because 3am alerts deserve instant answers, not hour-long hunts.

Every Industry Has Rules. We Master Them.

See how healthcare, finance, and government teams cut monitoring costs 90% while staying audit-ready.

Monitor Any Technology. Configure Nothing.

Install the agent. It already knows your stack.
From Our Users
"A Rare Unicorn"

Netdata gives more than you invest in it. A rare unicorn that obeys the Pareto rule.

— Eduard Porquet Mateu, TMB Barcelona

99% Downtime Reduction

Reduced website downtime by 99% and cloud bill by 30% using Netdata alerts.

— Falkland Islands Government

Real Savings
30% Cloud Cost Reduction

Optimized resource allocation based on Netdata alerts cut cloud spending by 30%.

— Falkland Islands Government

46% Cost Cut

Reduced monitoring staff by 67% while cutting operational costs by 46%.

— Codyas

Real Coverage
"Plugin for Everything"

Netdata has agent capacity or a plugin for everything, including Windows and Kubernetes.

— Eduard Porquet Mateu, TMB Barcelona

"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

Real Speed
Troubleshooting in 30 Seconds

From 2-3 minutes to 30 seconds—instant visibility into any node issue.

— Matthew Artist, Nodecraft

20% Downtime Reduction

20% less downtime and 40% budget optimization from out-of-the-box monitoring.

— Simon Beginn, LANCOM Systems

Pay per Node. Unlimited Everything Else.

One price per node. Unlimited metrics, logs, users, and retention. No per-GB surprises.

Free tier—forever
No metric limits or caps
Retention you control
Cancel anytime
> See pricing plans

What's Your Monitoring Really Costing You?

Most teams overpay by 40-60%. Let's find out why.

Expose hidden metric charges
Calculate tool consolidation
Customers report 30-67% savings
Results in under 60 seconds
> See what you're really paying

Your Infrastructure Is Unique. Let's Talk.

Because monitoring 10 nodes is different from monitoring 10,000.

On-prem & air-gapped deployment
Volume pricing & agreements
Architecture review for your scale
Compliance & security support
> Start a conversation

Monitoring That Sells Itself

Deploy in minutes. Impress clients in hours. Earn recurring revenue for years.

30-second live demos close deals
Zero config = zero support burden
Competitive margins & deal protection
Response in 48 hours
> Apply to partner

Per-Second Metrics at Homelab Prices

Same engine, same dashboards, same ML. Just priced for tinkerers.

Community: Free forever · 5 nodes · non-commercial
Homelab: $90/yr · unlimited nodes · fair usage
> Start monitoring your lab—free

$1,000 Per Referral. Unlimited Referrals.

Your colleagues get 10% off. You get 10% commission. Everyone wins.

10% of subscriptions, up to $1,000 each
Track earnings inside Netdata Cloud
PayPal/Venmo payouts in 3-4 weeks
No caps, no complexity
> Get your referral link
Cost Proof
40% Budget Optimization

"Netdata's significant positive impact" — LANCOM Systems

Calculate Your Savings

Compare vs Datadog, Grafana, Dynatrace

Savings Proof
46% Cost Reduction

"Cut costs by 46%, staff by 67%" — Codyas

30% Cloud Bill Savings

"Reduced cloud bill by 30%" — Falkland Islands Gov

Enterprise Proof
"Better Than Combined Alternatives"

"Better observability with Netdata than combining other tools." — TMB Barcelona

Real Engineers, <24h Response

DPA, SLAs, on-prem, volume pricing

Why Partners Win
Demo Live Infrastructure

One command, 30 seconds, real data—no sandbox needed

Zero Tickets, High Margins

Auto-config + per-node pricing = predictable profit

Homelab Ready
"Absolutely Incredible"

"We tested every monitoring system under the sun." — Benjamin Gabler, CEO Rocket.Net

76k+ GitHub Stars

3rd most starred monitoring project

Worth Recommending
Product That Delivers

Customers report 40-67% cost cuts, 99% downtime reduction

Zero Risk to Your Rep

Free tier lets them try before they buy

AI Support Assistant, Available 24/7

Nedi has access to all official documentation, source code, and resources. Ask any question about Netdata—responds in your language.

Deployment & configuration
Troubleshooting & sizing
Alerts & notifications
Evidence-based answers
> Ask Nedi now

Never Fight Fires Alone

Docs, community, and expert help—pick your path to resolution.

Learn.netdata.cloud docs
Discord, Forums, GitHub
Premium support available
> Get answers now

60 Seconds to First Dashboard

One command to install. Zero config. 850+ integrations documented.

Linux, Windows, K8s, Docker
Auto-discovers your stack
> Read our documentation

Level Up Your Monitoring

Real problems. Real solutions. 112+ guides from basic monitoring to AI observability.

76,000+ Engineers Strong

615+ contributors. 1.5M daily downloads. One mission: simplify observability.

Per-Second. 90% Cheaper. Data Stays Home.

Side-by-side comparisons: costs, real-time granularity, and data sovereignty for every major tool.

See why teams switch from Datadog, Prometheus, Grafana, and more.

> Browse all comparisons
Edge-Native Observability, Born Open Source
Per-second visibility, ML on every metric, and data that never leaves your infrastructure.
Founded in 2016
615+ contributors worldwide
Remote-first, engineering-driven
Open source first
> Read our story
Promises We Publish—and Prove
12 principles backed by open code, independent validation, and measurable outcomes.
Open source, peer-reviewed
Zero config, instant value
Data sovereignty by design
Aligned pricing, no surprises
> See all 12 principles
Edge-Native, AI-Ready, 100% Open
76k+ stars. Full ML, AI, and automation—GPLv3+, not premium add-ons.
76,000+ GitHub stars
GPLv3+ licensed forever
ML on every metric, included
Zero vendor lock-in
> Explore our open source
Build Real-Time Observability for the World
Remote-first team shipping per-second monitoring with ML on every metric.
Remote-first, fully distributed
Open source (76k+ stars)
Challenging technical problems
Your code on millions of systems
> See open roles
Talk to a Netdata Human in <24 Hours
Sales, partnerships, press, or professional services—real engineers, fast answers.
Discuss your observability needs
Pricing and volume discounts
Partnership opportunities
Media and press inquiries
> Book a conversation
Your Data. Your Rules.
On-prem data, cloud control plane, transparent terms.
Trust & Scale
76,000+ GitHub Stars

One of the most popular open-source monitoring projects

SOC 2 Type 2 Certified

Enterprise-grade security and compliance

Data Sovereignty

Your metrics stay on your infrastructure

Validated
University of Amsterdam

"Most energy-efficient monitoring solution" — ICSOC 2023, peer-reviewed

ADASTEC (Autonomous Driving)

"Doesn't miss alerts—mission-critical trust for safety software"

Community Stats
615+ Contributors

Global community improving monitoring for everyone

1.5M+ Downloads/Day

Trusted by teams worldwide

GPLv3+ Licensed

Free forever, fully open source agent

Why Join?
Remote-First

Work from anywhere, async-friendly culture

Impact at Scale

Your work helps millions of systems

Nvidia Data Center GPU Manager (DCGM) icon

Nvidia Data Center GPU Manager (DCGM)

Nvidia Data Center GPU Manager (DCGM)

Plugin: go.d.plugin Module: dcgm

Overview

This collector gathers NVIDIA GPU telemetry from a dcgm-exporter endpoint. It supports all numeric fields exposed by the exporter and maps them into Netdata-native contexts.

It collects metrics by periodically scraping the exporter Prometheus endpoint over HTTP.

This collector is supported on all platforms.

This collector supports collecting metrics from multiple instances of this integration, including remote instances.

Nvidia Data Center GPU Manager (DCGM) can be monitored further using the following other integrations:

  • {% relatedResource id=“go.d.plugin-nvidia_smi-Nvidia_GPU” %}Nvidia GPU{% /relatedResource %}

Default Behavior

Auto-Detection

This integration does not support auto-detection in v1.

Limits

The collector applies global and per-metric time series limits to prevent excessive cardinality.

Performance Impact

The impact depends on dcgm-exporter field selection and resulting series cardinality.

Setup

You can configure the dcgm collector in two ways:

MethodBest forHow to
UIFast setup without editing filesGo to Nodes → Configure this node → Collectors → Jobs, search for dcgm, then click + to add a job.
FileIf you prefer configuring via file, or need to automate deployments (e.g., with Ansible)Edit go.d/dcgm.conf and add a job.

:::important

UI configuration requires paid Netdata Cloud plan.

:::

Prerequisites

Run dcgm-exporter

Install DCGM and run dcgm-exporter so that a Prometheus endpoint is available (default :9400/metrics).

Configure exporter field list

The default exporter profile exposes a small subset of fields. Use the Netdata recommended profile: dcgm-exporter-netdata.csv (raw download: https://raw.githubusercontent.com/netdata/netdata/master/src/go/plugin/go.d/collector/dcgm/dcgm-exporter-netdata.csv).

The Netdata profile enables 127 fields by default and documents all remaining known DCGM fields as commented entries. To customize beyond the baseline, uncomment the field you need and comment one currently enabled field.

Runtime validation artifact: src/go/plugin/go.d/collector/dcgm/runtime-validation-driver-590.48.01-dcgm-exporter-4.4.1-4.5.2.md and src/go/plugin/go.d/collector/dcgm/runtime-validation-driver-590.48.01-dcgm-exporter-4.4.1-4.5.2.json

Validation is primarily version-scoped (NVIDIA driver + DCGM/DCGM-exporter versions), so treat it as a strong baseline rather than universal compatibility.

Example: dcgm-exporter -f /path/to/dcgm-exporter-netdata.csv

Keep collection intervals aligned

Set Netdata update_every to the same value as dcgm-exporter collection interval (default 30 seconds). Example exporter interval: dcgm-exporter -c 30000 and Netdata update_every: 30.

Enable profiling capabilities (optional)

Profiling fields may require additional privileges/capabilities in your runtime environment.

Configuration

Options

The following options can be defined globally: update_every, autodetection_retry.

GroupOptionDescriptionDefaultRequired
Collectionupdate_everyData collection interval (seconds). Keep this aligned with dcgm-exporter collection interval.30no
autodetection_retryAutodetection retry interval (seconds). Set 0 to disable.0no
TargeturlDCGM exporter metrics endpoint URL.http://127.0.0.1:9400/metricsyes
timeoutHTTP request timeout (seconds).10no
Limitsmax_time_seriesGlobal time series limit. If exceeded, collection is skipped for this cycle.2000no
max_time_series_per_metricPer-metric time series limit. Metrics above this limit are skipped.200no
HTTP AuthusernameUsername for Basic HTTP authentication.no
passwordPassword for Basic HTTP authentication.no
bearer_token_filePath to a file containing a bearer token.no
TLStls_skip_verifySkip TLS certificate and hostname verification (insecure).nono
tls_caPath to CA bundle used to validate the server certificate.no
tls_certPath to client TLS certificate (for mTLS).no
tls_keyPath to client TLS private key (for mTLS).no
Proxyproxy_urlHTTP proxy URL.no
proxy_usernameUsername for proxy authentication.no
proxy_passwordPassword for proxy authentication.no
RequestheadersAdditional HTTP headers to include in the request.no
methodHTTP method.GETno
bodyHTTP request body.no
not_follow_redirectsDo not follow HTTP redirects.nono
force_http2Force HTTP/2 (including h2c over TCP).nono
Virtual NodevnodeAssociate this job with a Virtual Node.no

via UI

Configure the dcgm collector from the Netdata web interface:

  1. Go to Nodes.
  2. Select the node where you want the dcgm data-collection job to run and click the :gear: (Configure this node). That node will run the data collection.
  3. The Collectors → Jobs view opens by default.
  4. In the Search box, type dcgm (or scroll the list) to locate the dcgm collector.
  5. Click the + next to the dcgm collector to add a new job.
  6. Fill in the job fields, then click Test to verify the configuration and Submit to save.
    • Test runs the job with the provided settings and shows whether data can be collected.
    • If it fails, an error message appears with details (for example, connection refused, timeout, or command execution errors), so you can adjust and retest.

via File

The configuration file name for this integration is go.d/dcgm.conf.

The file format is YAML. Generally, the structure is:

update_every: 1
autodetection_retry: 0
jobs:
  - name: some_name1
  - name: some_name2

You can edit the configuration file using the edit-config script from the Netdata config directory.

cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata
sudo ./edit-config go.d/dcgm.conf
Examples
Local exporter

Collect metrics from a local dcgm-exporter endpoint.

jobs:
  - name: local
    url: http://127.0.0.1:9400/metrics
    update_every: 30
TLS endpoint

Collect metrics over HTTPS with custom CA certificate.

jobs:
  - name: secure
    url: https://dcgm-exporter.example.com:9400/metrics
    update_every: 30
    tls_ca: /etc/netdata/certs/dcgm-ca.crt
Increased cardinality limits

Increase limits when collecting large field sets and multiple entities.

jobs:
  - name: dcgm_large
    url: http://127.0.0.1:9400/metrics
    update_every: 30
    max_time_series: 10000
    max_time_series_per_metric: 2000

Metrics

Metrics grouped by scope.

The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.

Metrics are grouped into static Netdata contexts. Contexts are created only when matching DCGM fields are present in the exporter output.

Per gpu

These metrics refer to GPU device instances.

Labels:

LabelDescription
gpugpu label from exporter metrics.
uuiduuid label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.gpu.capability.supportcc_mode, cuda_compute_capability, gpm_support, mig_attributes, mig_ci_info, mig_gi_info, mig_max_slices, supported_clocks, supported_type_infostate
dcgm.gpu.clock.frequencyapp_mem_clock, app_sm_clock, max_mem_clock, max_sm_clock, max_video_clock, memory, sm, video_clockMHz
dcgm.gpu.compute.activitydram, fp16, fp32, fp64, graphics_engine_active, integer, sm_active, sm_occupancy, tensor%
dcgm.gpu.compute.tensor.activitytensor_dfma, tensor_hmma, tensor_imma%
dcgm.gpu.compute.media.activitynvdec0_active, nvdec1_active, nvdec2_active, nvdec3_active, nvdec4_active, nvdec5_active, nvdec6_active, nvdec7_active, nvjpg0_active, nvjpg1_active, nvjpg2_active, nvjpg3_active, nvjpg4_active, nvjpg5_active, nvjpg6_active, nvjpg7_active, nvofa0_active, nvofa1_active%
dcgm.gpu.compute.cache.activityhostmem_cache_hit, hostmem_cache_miss, peermem_cache_hit, peermem_cache_missevents/s
dcgm.gpu.compute.utilizationdecoder, encoder, gpu, memory_copy%
dcgm.gpu.cpu.powermodule_power_util_current, sysio_power_util_currentWatts
dcgm.gpu.cpu.infocpu_model, cpu_vendorvalue
dcgm.gpu.diagnostics.resultsdiag_diagnostic_result, diag_eud_result, diag_memory_bandwidth_result, diag_memory_result, diag_memtest_result, diag_nccl_tests_result, diag_nvbandwidth_result, diag_pulse_test_result, diag_software_result, diag_targeted_power_result, diag_targeted_stress_resultstate
dcgm.gpu.diagnostics.statusdiag_statusstate
dcgm.gpu.health.statusimex_daemon_status, imex_domain_statusstate
dcgm.gpu.interconnect.connectx.error_statusconnectx_correctable_err_mask, connectx_correctable_err_status, connectx_uncorrectable_err_mask, connectx_uncorrectable_err_severity, connectx_uncorrectable_err_statusstate
dcgm.gpu.interconnect.connectx.errorsconnectx_correctable_err_mask, connectx_correctable_err_status, connectx_uncorrectable_err_mask, connectx_uncorrectable_err_severity, connectx_uncorrectable_err_statuserrors/s
dcgm.gpu.interconnect.connectx.linkconnectx_active_pcie_link_speed, connectx_expect_pcie_link_speedvalue
dcgm.gpu.interconnect.connectx.statusconnectx_healthstate
dcgm.gpu.interconnect.error_ratec2c_link_error_intr, c2c_link_error_replay, c2c_link_error_replay_b2berrors/s
dcgm.gpu.interconnect.fabricfabric_clique_id, fabric_cluster_uuid, fabric_health_mask, fabric_manager_error_code, fabric_manager_statusstate
dcgm.gpu.interconnect.nvlink.error_rategpu_nvlink_errorserrors/s
dcgm.gpu.interconnect.pcie.error_ratepcie_count_correctable_errors, pcie_replayerrors/s
dcgm.gpu.interconnect.pcie.link.generationlink_gen, max_link_gengeneration
dcgm.gpu.interconnect.pcie.link.widthconnectx_active_pcie_link_width, connectx_expect_pcie_link_width, link_width, max_link_widthlanes
dcgm.gpu.interconnect.statec2c_link, c2c_link_power_state, c2c_link_statusstate
dcgm.gpu.interconnect.pcie.statediag_pcie_resultstate
dcgm.gpu.interconnect.throughputc2c_max_bandwidth, c2c_rx_all_bytes, c2c_rx_data_bytes, c2c_tx_all_bytes, c2c_tx_data_bytesB/s
dcgm.gpu.interconnect.pcie.throughputpcie_rx, pcie_rx_throughput, pcie_tx, pcie_tx_throughputB/s
dcgm.gpu.interconnect.nvlink.throughputnvlink_rx, nvlink_txB/s
dcgm.gpu.interconnect.total.throughputpcie, nvlinkB/s
dcgm.gpu.internal.boundaryfirst_connectx_field_id, first_vgpu_field_id, internal_fields_0_end, internal_fields_0_start, last_connectx_field_id, last_vgpu_field_idstate
dcgm.gpu.inventory.identitybrand, count, cuda_visible_devices_str, minor_number, name, nvml_index, serial, uuidvalue
dcgm.gpu.inventory.platformplatform_chassis_serial_number, platform_chassis_slot_number, platform_host_id, platform_infiniband_guid, platform_module_id, platform_peer_type, platform_tray_indexvalue
dcgm.gpu.inventory.softwareinforom_config_check, inforom_config_valid, inforom_image_ver, oem_inforom_ver, power_inforom_ver, process_name, vbios_versionvalue
dcgm.gpu.memory.bar1_usagefree, usedB
dcgm.gpu.memory.bar1_capacitytotalB
dcgm.gpu.memory.ecc_error_rateecc_current, ecc_dbe_agg, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_pending, ecc_sbe_agg, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_texerrors/s
dcgm.gpu.memory.ecc_errorsecc_current, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_inforom_ver, ecc_pending, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_texerrors
dcgm.gpu.memory.page_retirementsretired_dbe, retired_pending, retired_sbepages/s
dcgm.gpu.memory.usagefree, reserved, usedB
dcgm.gpu.memory.capacitytotalB
dcgm.gpu.memory.utilizationused_percent%
dcgm.gpu.power.energytotalmJ/s
dcgm.gpu.power.profilesenforced_power_profile_mask, requested_power_profile_mask, valid_power_profile_maskstate
dcgm.gpu.power.smoothingpwr_smoothing_active_preset_profile, pwr_smoothing_admin_override_percent_tmp_floor, pwr_smoothing_admin_override_ramp_down_hyst_val, pwr_smoothing_admin_override_ramp_down_rate, pwr_smoothing_admin_override_ramp_up_rate, pwr_smoothing_applied_tmp_ceil, pwr_smoothing_applied_tmp_floor, pwr_smoothing_enabled, pwr_smoothing_hw_circuitry_percent_lifetime_remaining, pwr_smoothing_imm_ramp_down_enabled, pwr_smoothing_max_num_preset_profiles, pwr_smoothing_max_percent_tmp_floor_setting, pwr_smoothing_min_percent_tmp_floor_setting, pwr_smoothing_priv_lvl, pwr_smoothing_profile_percent_tmp_floor, pwr_smoothing_profile_ramp_down_hyst_val, pwr_smoothing_profile_ramp_down_rate, pwr_smoothing_profile_ramp_up_ratevalue
dcgm.gpu.power.usagedraw, enforced_limit, power_mgmt_limit, power_mgmt_limit_def, power_mgmt_limit_max, power_mgmt_limit_min, power_usage_instantWatts
dcgm.gpu.reliability.memory_healthbanks_remap_rows_avail_high, banks_remap_rows_avail_low, banks_remap_rows_avail_max, banks_remap_rows_avail_none, banks_remap_rows_avail_partial, memory_unrepairable_flag, threshold_srmstate
dcgm.gpu.reliability.recovery_actionget_gpu_recovery_actionstate
dcgm.gpu.reliability.row_remap_eventscorrectable_remapped_rows, uncorrectable_remapped_rowsrows/s
dcgm.gpu.reliability.row_remap_statusrow_remap_failure, row_remap_pendingstate
dcgm.gpu.reliability.xidxidcode
dcgm.gpu.state.configurationautoboost, compute_mode, persistence_mode, sync_boost, sync_boost_violationstate
dcgm.gpu.state.performancepstatestate
dcgm.gpu.state.virtualizationmig_mode, virtual_modestate
dcgm.gpu.thermal.fan_speedfan_speed%
dcgm.gpu.thermal.temperatureconnectx_device_temperature, gpu, gpu_max_op_temp, gpu_temp_limit, mem_max_op_temp, memory, shutdown_temp, slowdown_tempCelsius
dcgm.gpu.throttle.reasonsclocks_event_reasonsbitmask
dcgm.gpu.throttle.violationsboard_limit_violation, hw_power_brake_slowdown, hw_therm_slowdown, low_utilization_violation, power_violation, reliability_violation, sw_power_cap, sw_therm_slowdown, sync_boost, thermal_violation, total_app_clocks_violation, total_base_clocks_violationmilliseconds/s
dcgm.gpu.topology.affinitycpu_affinity_0, cpu_affinity_1, cpu_affinity_2, cpu_affinity_3, gpu_topology_affinity, gpu_topology_pci, mem_affinity_0, mem_affinity_1, mem_affinity_2, mem_affinity_3, pci_busid, pci_combined_id, pci_subsys_idvalue
dcgm.gpu.virtualization.vgpu.frame_ratevgpu_frame_rate_limitfps
dcgm.gpu.virtualization.vgpu.instancevgpu_instance_ids, vgpu_pci_id, vgpu_uuidvalue
dcgm.gpu.virtualization.vgpu.licensevgpu_instance_license_state, vgpu_license_status, vgpu_type_licensestate
dcgm.gpu.virtualization.vgpu.memoryvgpu_memory_usageB
dcgm.gpu.virtualization.vgpu.sessionsvgpu_enc_sessions_info, vgpu_enc_stats, vgpu_fbc_sessions_info, vgpu_fbc_statsvalue
dcgm.gpu.virtualization.vgpu.softwarevgpu_driver_versionvalue
dcgm.gpu.virtualization.vgpu.typecreatable_vgpu_type_ids, supported_vgpu_type_ids, vgpu_type, vgpu_type_class, vgpu_type_info, vgpu_type_namevalue
dcgm.gpu.virtualization.vgpu.utilizationvgpu_per_process_utilization%
dcgm.gpu.virtualization.vgpu.vmvgpu_vm_gpu_instance_id, vgpu_vm_id, vgpu_vm_namevalue
dcgm.gpu.workload.sessionsaccounting_data, enc_stats, fbc_sessions_info, fbc_statsvalue

Per mig

These metrics refer to MIG instances.

Labels:

LabelDescription
gpugpu label from exporter metrics.
gpu_i_idgpu_i_id label from exporter metrics.
gpu_i_profilegpu_i_profile label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.mig.clock.frequencyapp_mem_clock, app_sm_clock, max_mem_clock, max_sm_clock, max_video_clock, memory, sm, video_clockMHz
dcgm.mig.compute.activitydram, fp16, fp32, fp64, graphics_engine_active, integer, sm_active, sm_occupancy, tensor%
dcgm.mig.compute.tensor.activitytensor_dfma, tensor_hmma, tensor_imma%
dcgm.mig.compute.media.activitynvdec0_active, nvdec1_active, nvdec2_active, nvdec3_active, nvdec4_active, nvdec5_active, nvdec6_active, nvdec7_active, nvjpg0_active, nvjpg1_active, nvjpg2_active, nvjpg3_active, nvjpg4_active, nvjpg5_active, nvjpg6_active, nvjpg7_active, nvofa0_active, nvofa1_active%
dcgm.mig.compute.cache.activityhostmem_cache_hit, hostmem_cache_miss, peermem_cache_hit, peermem_cache_missevents/s
dcgm.mig.compute.utilizationdecoder, encoder, gpu, memory_copy%
dcgm.mig.interconnect.nvlink.bernvlink_count_effective_ber, nvlink_count_effective_ber_float, nvlink_count_symbol_ber, nvlink_count_symbol_ber_floatratio
dcgm.mig.interconnect.nvlink.congestionnvlink_ppcnt_ibpc_port_xmit_waitevents/s
dcgm.mig.interconnect.error_ratec2c_link_error_intr, c2c_link_error_replay, c2c_link_error_replay_b2berrors/s
dcgm.mig.interconnect.nvlink.error_rategpu_nvlink_errors, nvlink_count_effective_errors, nvlink_count_fec_history_0, nvlink_count_fec_history_1, nvlink_count_fec_history_10, nvlink_count_fec_history_11, nvlink_count_fec_history_12, nvlink_count_fec_history_13, nvlink_count_fec_history_14, nvlink_count_fec_history_15, nvlink_count_fec_history_2, nvlink_count_fec_history_3, nvlink_count_fec_history_4, nvlink_count_fec_history_5, nvlink_count_fec_history_6, nvlink_count_fec_history_7, nvlink_count_fec_history_8, nvlink_count_fec_history_9, nvlink_count_link_recovery_events, nvlink_count_link_recovery_failed_events, nvlink_count_link_recovery_successful_events, nvlink_count_local_link_integrity_errors, nvlink_count_rx_buffer_overrun_errors, nvlink_count_rx_errors, nvlink_count_rx_general_errors, nvlink_count_rx_malformed_packet_errors, nvlink_count_rx_remote_errors, nvlink_count_rx_symbol_errors, nvlink_count_tx_discards, nvlink_crc_data_error, nvlink_crc_data_error_count_l0, nvlink_crc_data_error_count_l1, nvlink_crc_data_error_count_l10, nvlink_crc_data_error_count_l11, nvlink_crc_data_error_count_l12, nvlink_crc_data_error_count_l13, nvlink_crc_data_error_count_l14, nvlink_crc_data_error_count_l15, nvlink_crc_data_error_count_l16, nvlink_crc_data_error_count_l17, nvlink_crc_data_error_count_l2, nvlink_crc_data_error_count_l3, nvlink_crc_data_error_count_l4, nvlink_crc_data_error_count_l5, nvlink_crc_data_error_count_l6, nvlink_crc_data_error_count_l7, nvlink_crc_data_error_count_l8, nvlink_crc_data_error_count_l9, nvlink_crc_flit_error, nvlink_crc_flit_error_count_l0, nvlink_crc_flit_error_count_l1, nvlink_crc_flit_error_count_l10, nvlink_crc_flit_error_count_l11, nvlink_crc_flit_error_count_l12, nvlink_crc_flit_error_count_l13, nvlink_crc_flit_error_count_l14, nvlink_crc_flit_error_count_l15, nvlink_crc_flit_error_count_l16, nvlink_crc_flit_error_count_l17, nvlink_crc_flit_error_count_l2, nvlink_crc_flit_error_count_l3, nvlink_crc_flit_error_count_l4, nvlink_crc_flit_error_count_l5, nvlink_crc_flit_error_count_l6, nvlink_crc_flit_error_count_l7, nvlink_crc_flit_error_count_l8, nvlink_crc_flit_error_count_l9, nvlink_error_dl_crc, nvlink_error_dl_recovery, nvlink_error_dl_replay, nvlink_ppcnt_physical_successful_recovery_events, nvlink_ppcnt_plr_rcv_uncorrectable_code, nvlink_ppcnt_recovery_time_since_last, nvlink_ppcnt_recovery_total_successful_events, nvlink_pprm_oper_recovery, nvlink_recovery_error, nvlink_recovery_error_count_l0, nvlink_recovery_error_count_l1, nvlink_recovery_error_count_l10, nvlink_recovery_error_count_l11, nvlink_recovery_error_count_l12, nvlink_recovery_error_count_l13, nvlink_recovery_error_count_l14, nvlink_recovery_error_count_l15, nvlink_recovery_error_count_l16, nvlink_recovery_error_count_l17, nvlink_recovery_error_count_l2, nvlink_recovery_error_count_l3, nvlink_recovery_error_count_l4, nvlink_recovery_error_count_l5, nvlink_recovery_error_count_l6, nvlink_recovery_error_count_l7, nvlink_recovery_error_count_l8, nvlink_recovery_error_count_l9, nvlink_replay_error, nvlink_replay_error_count_l0, nvlink_replay_error_count_l1, nvlink_replay_error_count_l10, nvlink_replay_error_count_l11, nvlink_replay_error_count_l12, nvlink_replay_error_count_l13, nvlink_replay_error_count_l14, nvlink_replay_error_count_l15, nvlink_replay_error_count_l16, nvlink_replay_error_count_l17, nvlink_replay_error_count_l2, nvlink_replay_error_count_l3, nvlink_replay_error_count_l4, nvlink_replay_error_count_l5, nvlink_replay_error_count_l6, nvlink_replay_error_count_l7, nvlink_replay_error_count_l8, nvlink_replay_error_count_l9errors/s
dcgm.mig.interconnect.pcie.error_ratepcie_count_correctable_errors, pcie_replayerrors/s
dcgm.mig.interconnect.nvlink.errorsnvlink_ppcnt_plr_rcv_uncorrectable_codeerrors
dcgm.mig.interconnect.fabricfabric_clique_id, fabric_cluster_uuid, fabric_health_mask, fabric_manager_error_code, fabric_manager_statusstate
dcgm.mig.interconnect.pcie.link.generationlink_gen, max_link_gengeneration
dcgm.mig.interconnect.pcie.link.widthlink_width, max_link_widthlanes
dcgm.mig.interconnect.statec2c_link, c2c_link_power_state, c2c_link_statusstate
dcgm.mig.interconnect.pcie.statediag_pcie_resultstate
dcgm.mig.interconnect.nvlink.stategpu_topology_nvlink, nvlink_get_state, nvlink_ppcnt_physical_link_down_counter, nvlink_ppcnt_plr_rcv_code_err, nvlink_ppcnt_plr_sync_events, nvlink_ppcnt_plr_xmit_retry_events, p2p_nvlink_statusstate
dcgm.mig.interconnect.throughputc2c_max_bandwidth, c2c_rx_all_bytes, c2c_rx_data_bytes, c2c_tx_all_bytes, c2c_tx_data_bytesB/s
dcgm.mig.interconnect.nvlink.throughputnvlink_bandwidth_l0, nvlink_bandwidth_l1, nvlink_bandwidth_l10, nvlink_bandwidth_l11, nvlink_bandwidth_l12, nvlink_bandwidth_l13, nvlink_bandwidth_l14, nvlink_bandwidth_l15, nvlink_bandwidth_l16, nvlink_bandwidth_l17, nvlink_bandwidth_l2, nvlink_bandwidth_l3, nvlink_bandwidth_l4, nvlink_bandwidth_l5, nvlink_bandwidth_l6, nvlink_bandwidth_l7, nvlink_bandwidth_l8, nvlink_bandwidth_l9, nvlink_count_rx, nvlink_count_tx, nvlink_l0_rx, nvlink_l0_tx, nvlink_l10_rx, nvlink_l10_tx, nvlink_l11_rx, nvlink_l11_tx, nvlink_l12_rx, nvlink_l12_tx, nvlink_l13_rx, nvlink_l13_tx, nvlink_l14_rx, nvlink_l14_tx, nvlink_l15_rx, nvlink_l15_tx, nvlink_l16_rx, nvlink_l16_tx, nvlink_l17_rx, nvlink_l17_tx, nvlink_l1_rx, nvlink_l1_tx, nvlink_l2_rx, nvlink_l2_tx, nvlink_l3_rx, nvlink_l3_tx, nvlink_l4_rx, nvlink_l4_tx, nvlink_l5_rx, nvlink_l5_tx, nvlink_l6_rx, nvlink_l6_tx, nvlink_l7_rx, nvlink_l7_tx, nvlink_l8_rx, nvlink_l8_tx, nvlink_l9_rx, nvlink_l9_tx, nvlink_rx_bandwidth, nvlink_rx_bandwidth_l0, nvlink_rx_bandwidth_l1, nvlink_rx_bandwidth_l10, nvlink_rx_bandwidth_l11, nvlink_rx_bandwidth_l12, nvlink_rx_bandwidth_l13, nvlink_rx_bandwidth_l14, nvlink_rx_bandwidth_l15, nvlink_rx_bandwidth_l16, nvlink_rx_bandwidth_l17, nvlink_rx_bandwidth_l2, nvlink_rx_bandwidth_l3, nvlink_rx_bandwidth_l4, nvlink_rx_bandwidth_l5, nvlink_rx_bandwidth_l6, nvlink_rx_bandwidth_l7, nvlink_rx_bandwidth_l8, nvlink_rx_bandwidth_l9, nvlink_rx, nvlink_tx_bandwidth, nvlink_tx_bandwidth_l0, nvlink_tx_bandwidth_l1, nvlink_tx_bandwidth_l10, nvlink_tx_bandwidth_l11, nvlink_tx_bandwidth_l12, nvlink_tx_bandwidth_l13, nvlink_tx_bandwidth_l14, nvlink_tx_bandwidth_l15, nvlink_tx_bandwidth_l16, nvlink_tx_bandwidth_l17, nvlink_tx_bandwidth_l2, nvlink_tx_bandwidth_l3, nvlink_tx_bandwidth_l4, nvlink_tx_bandwidth_l5, nvlink_tx_bandwidth_l6, nvlink_tx_bandwidth_l7, nvlink_tx_bandwidth_l8, nvlink_tx_bandwidth_l9, nvlink_txB/s
dcgm.mig.interconnect.pcie.throughputpcie_rx, pcie_rx_throughput, pcie_tx, pcie_tx_throughputB/s
dcgm.mig.interconnect.total.throughputpcie, nvlinkB/s
dcgm.mig.interconnect.nvlink.trafficnvlink_count_rx_packets, nvlink_count_tx_packets, nvlink_ppcnt_plr_rcv_codes, nvlink_ppcnt_plr_xmit_codes, nvlink_ppcnt_plr_xmit_retry_codesevents/s
dcgm.mig.memory.bar1_usagefree, usedB
dcgm.mig.memory.bar1_capacitytotalB
dcgm.mig.memory.ecc_error_rateecc_current, ecc_dbe_agg, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_pending, ecc_sbe_agg, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_tex, nvlink_ecc_data_errorerrors/s
dcgm.mig.memory.ecc_errorsecc_current, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_inforom_ver, ecc_pending, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_texerrors
dcgm.mig.memory.page_retirementsretired_dbe, retired_pending, retired_sbepages/s
dcgm.mig.memory.usagefree, reserved, usedB
dcgm.mig.memory.capacitytotalB
dcgm.mig.memory.utilizationused_percent%
dcgm.mig.power.energytotalmJ/s
dcgm.mig.power.profilesenforced_power_profile_mask, requested_power_profile_mask, valid_power_profile_maskstate
dcgm.mig.power.smoothingpwr_smoothing_active_preset_profile, pwr_smoothing_admin_override_percent_tmp_floor, pwr_smoothing_admin_override_ramp_down_hyst_val, pwr_smoothing_admin_override_ramp_down_rate, pwr_smoothing_admin_override_ramp_up_rate, pwr_smoothing_applied_tmp_ceil, pwr_smoothing_applied_tmp_floor, pwr_smoothing_enabled, pwr_smoothing_hw_circuitry_percent_lifetime_remaining, pwr_smoothing_imm_ramp_down_enabled, pwr_smoothing_max_num_preset_profiles, pwr_smoothing_max_percent_tmp_floor_setting, pwr_smoothing_min_percent_tmp_floor_setting, pwr_smoothing_priv_lvl, pwr_smoothing_profile_percent_tmp_floor, pwr_smoothing_profile_ramp_down_hyst_val, pwr_smoothing_profile_ramp_down_rate, pwr_smoothing_profile_ramp_up_ratevalue
dcgm.mig.power.usagedraw, enforced_limit, power_mgmt_limit, power_mgmt_limit_def, power_mgmt_limit_max, power_mgmt_limit_min, power_usage_instantWatts
dcgm.mig.reliability.memory_healthbanks_remap_rows_avail_high, banks_remap_rows_avail_low, banks_remap_rows_avail_max, banks_remap_rows_avail_none, banks_remap_rows_avail_partial, memory_unrepairable_flag, threshold_srmstate
dcgm.mig.reliability.recovery_actionget_gpu_recovery_actionstate
dcgm.mig.reliability.row_remap_eventscorrectable_remapped_rows, uncorrectable_remapped_rowsrows/s
dcgm.mig.reliability.row_remap_statusrow_remap_failure, row_remap_pendingstate
dcgm.mig.reliability.xidxidcode
dcgm.mig.state.configurationautoboost, compute_mode, persistence_mode, sync_boost, sync_boost_violationstate
dcgm.mig.state.performancepstatestate
dcgm.mig.state.virtualizationmig_mode, virtual_modestate
dcgm.mig.thermal.fan_speedfan_speed%
dcgm.mig.thermal.temperaturegpu, gpu_max_op_temp, gpu_temp_limit, mem_max_op_temp, memory, shutdown_temp, slowdown_tempCelsius
dcgm.mig.throttle.reasonsclocks_event_reasonsbitmask
dcgm.mig.throttle.violationsboard_limit_violation, hw_power_brake_slowdown, hw_therm_slowdown, low_utilization_violation, power_violation, reliability_violation, sw_power_cap, sw_therm_slowdown, sync_boost, thermal_violation, total_app_clocks_violation, total_base_clocks_violationmilliseconds/s

These metrics refer to NVLink link instances.

Labels:

LabelDescription
gpugpu label from exporter metrics.
gpu_uuidgpu_uuid label from exporter metrics.
nvlinknvlink label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.nvlink.interconnect.bernvlink_count_effective_ber, nvlink_count_effective_ber_float, nvlink_count_symbol_ber, nvlink_count_symbol_ber_floatratio
dcgm.nvlink.interconnect.congestionnvlink_ppcnt_ibpc_port_xmit_waitevents/s
dcgm.nvlink.interconnect.error_rategpu_nvlink_errors, nvlink_count_effective_errors, nvlink_count_fec_history_0, nvlink_count_fec_history_1, nvlink_count_fec_history_10, nvlink_count_fec_history_11, nvlink_count_fec_history_12, nvlink_count_fec_history_13, nvlink_count_fec_history_14, nvlink_count_fec_history_15, nvlink_count_fec_history_2, nvlink_count_fec_history_3, nvlink_count_fec_history_4, nvlink_count_fec_history_5, nvlink_count_fec_history_6, nvlink_count_fec_history_7, nvlink_count_fec_history_8, nvlink_count_fec_history_9, nvlink_count_link_recovery_events, nvlink_count_link_recovery_failed_events, nvlink_count_link_recovery_successful_events, nvlink_count_local_link_integrity_errors, nvlink_count_rx_buffer_overrun_errors, nvlink_count_rx_errors, nvlink_count_rx_general_errors, nvlink_count_rx_malformed_packet_errors, nvlink_count_rx_remote_errors, nvlink_count_rx_symbol_errors, nvlink_count_tx_discards, nvlink_crc_data_error, nvlink_crc_data_error_count_l0, nvlink_crc_data_error_count_l1, nvlink_crc_data_error_count_l10, nvlink_crc_data_error_count_l11, nvlink_crc_data_error_count_l12, nvlink_crc_data_error_count_l13, nvlink_crc_data_error_count_l14, nvlink_crc_data_error_count_l15, nvlink_crc_data_error_count_l16, nvlink_crc_data_error_count_l17, nvlink_crc_data_error_count_l2, nvlink_crc_data_error_count_l3, nvlink_crc_data_error_count_l4, nvlink_crc_data_error_count_l5, nvlink_crc_data_error_count_l6, nvlink_crc_data_error_count_l7, nvlink_crc_data_error_count_l8, nvlink_crc_data_error_count_l9, nvlink_crc_flit_error, nvlink_crc_flit_error_count_l0, nvlink_crc_flit_error_count_l1, nvlink_crc_flit_error_count_l10, nvlink_crc_flit_error_count_l11, nvlink_crc_flit_error_count_l12, nvlink_crc_flit_error_count_l13, nvlink_crc_flit_error_count_l14, nvlink_crc_flit_error_count_l15, nvlink_crc_flit_error_count_l16, nvlink_crc_flit_error_count_l17, nvlink_crc_flit_error_count_l2, nvlink_crc_flit_error_count_l3, nvlink_crc_flit_error_count_l4, nvlink_crc_flit_error_count_l5, nvlink_crc_flit_error_count_l6, nvlink_crc_flit_error_count_l7, nvlink_crc_flit_error_count_l8, nvlink_crc_flit_error_count_l9, nvlink_error_dl_crc, nvlink_error_dl_recovery, nvlink_error_dl_replay, nvlink_ppcnt_physical_successful_recovery_events, nvlink_ppcnt_plr_rcv_uncorrectable_code, nvlink_ppcnt_recovery_time_since_last, nvlink_ppcnt_recovery_total_successful_events, nvlink_pprm_oper_recovery, nvlink_recovery_error, nvlink_recovery_error_count_l0, nvlink_recovery_error_count_l1, nvlink_recovery_error_count_l10, nvlink_recovery_error_count_l11, nvlink_recovery_error_count_l12, nvlink_recovery_error_count_l13, nvlink_recovery_error_count_l14, nvlink_recovery_error_count_l15, nvlink_recovery_error_count_l16, nvlink_recovery_error_count_l17, nvlink_recovery_error_count_l2, nvlink_recovery_error_count_l3, nvlink_recovery_error_count_l4, nvlink_recovery_error_count_l5, nvlink_recovery_error_count_l6, nvlink_recovery_error_count_l7, nvlink_recovery_error_count_l8, nvlink_recovery_error_count_l9, nvlink_replay_error, nvlink_replay_error_count_l0, nvlink_replay_error_count_l1, nvlink_replay_error_count_l10, nvlink_replay_error_count_l11, nvlink_replay_error_count_l12, nvlink_replay_error_count_l13, nvlink_replay_error_count_l14, nvlink_replay_error_count_l15, nvlink_replay_error_count_l16, nvlink_replay_error_count_l17, nvlink_replay_error_count_l2, nvlink_replay_error_count_l3, nvlink_replay_error_count_l4, nvlink_replay_error_count_l5, nvlink_replay_error_count_l6, nvlink_replay_error_count_l7, nvlink_replay_error_count_l8, nvlink_replay_error_count_l9errors/s
dcgm.nvlink.interconnect.errorsnvlink_ppcnt_plr_rcv_uncorrectable_codeerrors
dcgm.nvlink.interconnect.stategpu_topology_nvlink, nvlink_get_state, nvlink_ppcnt_physical_link_down_counter, nvlink_ppcnt_plr_rcv_code_err, nvlink_ppcnt_plr_sync_events, nvlink_ppcnt_plr_xmit_retry_events, p2p_nvlink_statusstate
dcgm.nvlink.interconnect.throughputnvlink_bandwidth, nvlink_bandwidth_l0, nvlink_bandwidth_l1, nvlink_bandwidth_l10, nvlink_bandwidth_l11, nvlink_bandwidth_l12, nvlink_bandwidth_l13, nvlink_bandwidth_l14, nvlink_bandwidth_l15, nvlink_bandwidth_l16, nvlink_bandwidth_l17, nvlink_bandwidth_l2, nvlink_bandwidth_l3, nvlink_bandwidth_l4, nvlink_bandwidth_l5, nvlink_bandwidth_l6, nvlink_bandwidth_l7, nvlink_bandwidth_l8, nvlink_bandwidth_l9, nvlink_count_rx, nvlink_count_tx, nvlink_l0_rx, nvlink_l0_tx, nvlink_l10_rx, nvlink_l10_tx, nvlink_l11_rx, nvlink_l11_tx, nvlink_l12_rx, nvlink_l12_tx, nvlink_l13_rx, nvlink_l13_tx, nvlink_l14_rx, nvlink_l14_tx, nvlink_l15_rx, nvlink_l15_tx, nvlink_l16_rx, nvlink_l16_tx, nvlink_l17_rx, nvlink_l17_tx, nvlink_l1_rx, nvlink_l1_tx, nvlink_l2_rx, nvlink_l2_tx, nvlink_l3_rx, nvlink_l3_tx, nvlink_l4_rx, nvlink_l4_tx, nvlink_l5_rx, nvlink_l5_tx, nvlink_l6_rx, nvlink_l6_tx, nvlink_l7_rx, nvlink_l7_tx, nvlink_l8_rx, nvlink_l8_tx, nvlink_l9_rx, nvlink_l9_tx, nvlink_rx_bandwidth, nvlink_rx_bandwidth_l0, nvlink_rx_bandwidth_l1, nvlink_rx_bandwidth_l10, nvlink_rx_bandwidth_l11, nvlink_rx_bandwidth_l12, nvlink_rx_bandwidth_l13, nvlink_rx_bandwidth_l14, nvlink_rx_bandwidth_l15, nvlink_rx_bandwidth_l16, nvlink_rx_bandwidth_l17, nvlink_rx_bandwidth_l2, nvlink_rx_bandwidth_l3, nvlink_rx_bandwidth_l4, nvlink_rx_bandwidth_l5, nvlink_rx_bandwidth_l6, nvlink_rx_bandwidth_l7, nvlink_rx_bandwidth_l8, nvlink_rx_bandwidth_l9, nvlink_rx, nvlink_tx_bandwidth, nvlink_tx_bandwidth_l0, nvlink_tx_bandwidth_l1, nvlink_tx_bandwidth_l10, nvlink_tx_bandwidth_l11, nvlink_tx_bandwidth_l12, nvlink_tx_bandwidth_l13, nvlink_tx_bandwidth_l14, nvlink_tx_bandwidth_l15, nvlink_tx_bandwidth_l16, nvlink_tx_bandwidth_l17, nvlink_tx_bandwidth_l2, nvlink_tx_bandwidth_l3, nvlink_tx_bandwidth_l4, nvlink_tx_bandwidth_l5, nvlink_tx_bandwidth_l6, nvlink_tx_bandwidth_l7, nvlink_tx_bandwidth_l8, nvlink_tx_bandwidth_l9, nvlink_txB/s
dcgm.nvlink.interconnect.trafficnvlink_count_rx_packets, nvlink_count_tx_packets, nvlink_ppcnt_plr_rcv_codes, nvlink_ppcnt_plr_xmit_codes, nvlink_ppcnt_plr_xmit_retry_codesevents/s
dcgm.nvlink.internal.boundarynvlink_ppcnt_recovery_time_between_last_twostate
dcgm.nvlink.memory.ecc_error_ratenvlink_ecc_data_errorerrors/s

Per nvswitch

These metrics refer to NVSwitch instances.

Labels:

LabelDescription
nvswitchnvswitch label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.nvswitch.interconnect.nvswitch.currentnvswitch_current_iddq, nvswitch_current_iddq_dvdd, nvswitch_current_iddq_revvalue
dcgm.nvswitch.interconnect.nvswitch.errorsnvswitch_fatal_errors, nvswitch_link_crc_errors, nvswitch_link_crc_errors_lane0, nvswitch_link_crc_errors_lane1, nvswitch_link_crc_errors_lane2, nvswitch_link_crc_errors_lane3, nvswitch_link_crc_errors_lane4, nvswitch_link_crc_errors_lane5, nvswitch_link_crc_errors_lane6, nvswitch_link_crc_errors_lane7, nvswitch_link_fatal_errors, nvswitch_link_flit_errors, nvswitch_link_non_fatal_errors, nvswitch_link_recovery_errors, nvswitch_link_replay_errors, nvswitch_non_fatal_errorserrors/s
dcgm.nvswitch.interconnect.nvswitch.latencynvswitch_link_latency_count_vc0, nvswitch_link_latency_count_vc1, nvswitch_link_latency_count_vc2, nvswitch_link_latency_count_vc3, nvswitch_link_latency_high_vc0, nvswitch_link_latency_high_vc1, nvswitch_link_latency_high_vc2, nvswitch_link_latency_high_vc3, nvswitch_link_latency_low_vc0, nvswitch_link_latency_low_vc1, nvswitch_link_latency_low_vc2, nvswitch_link_latency_low_vc3, nvswitch_link_latency_medium_vc0, nvswitch_link_latency_medium_vc1, nvswitch_link_latency_medium_vc2, nvswitch_link_latency_medium_vc3, nvswitch_link_latency_panic_vc0, nvswitch_link_latency_panic_vc1, nvswitch_link_latency_panic_vc2, nvswitch_link_latency_panic_vc3events/s
dcgm.nvswitch.interconnect.nvswitch.powernvswitch_power_dvdd, nvswitch_power_hvdd, nvswitch_power_vddWatts
dcgm.nvswitch.interconnect.nvswitch.statusnvswitch_link_status, nvswitch_link_type, nvswitch_reset_requiredstate
dcgm.nvswitch.interconnect.nvswitch.throughputnvswitch_link_throughput_rx, nvswitch_link_throughput_tx, nvswitch_throughput_rx, nvswitch_throughput_txB/s
dcgm.nvswitch.interconnect.nvswitch.topologynvswitch_device_uuid, nvswitch_link_device_link_id, nvswitch_link_device_link_sid, nvswitch_link_id, nvswitch_link_remote_pcie_bus, nvswitch_link_remote_pcie_device, nvswitch_link_remote_pcie_domain, nvswitch_link_remote_pcie_function, nvswitch_pcie_bus, nvswitch_pcie_device, nvswitch_pcie_domain, nvswitch_pcie_function, nvswitch_phys_idvalue
dcgm.nvswitch.interconnect.nvswitch.voltagenvswitch_voltage_mvoltmV
dcgm.nvswitch.internal.boundaryfirst_nvswitch_field_id, last_nvswitch_field_idstate
dcgm.nvswitch.memory.ecc_error_ratenvswitch_link_ecc_errors, nvswitch_link_ecc_errors_lane0, nvswitch_link_ecc_errors_lane1, nvswitch_link_ecc_errors_lane2, nvswitch_link_ecc_errors_lane3, nvswitch_link_ecc_errors_lane4, nvswitch_link_ecc_errors_lane5, nvswitch_link_ecc_errors_lane6, nvswitch_link_ecc_errors_lane7errors/s
dcgm.nvswitch.thermal.temperaturenvswitch_temperature_current, nvswitch_temperature_limit_shutdown, nvswitch_temperature_limit_slowdownCelsius

Per cpu

These metrics refer to host CPU instances.

Labels:

LabelDescription
cpucpu label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.cpu.clock.frequencycpu_clock_currentMHz
dcgm.cpu.cpu.infocpu_model, cpu_vendorvalue
dcgm.cpu.cpu.powercpu_power_limit, cpu_power_util_currentWatts
dcgm.cpu.cpu.temperaturecpu_temp_critical, cpu_temp_current, cpu_temp_warningCelsius
dcgm.cpu.cpu.utilizationcpu_util, cpu_util_irq, cpu_util_nice, cpu_util_sys, cpu_util_user%
dcgm.cpu.diagnostics.resultsdiag_cpu_eud_resultstate

Per cpu_core

These metrics refer to host CPU core instances.

Labels:

LabelDescription
cpucpu label from exporter metrics.
cpucorecpucore label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.cpu_core.clock.frequencycpu_clock_currentMHz
dcgm.cpu_core.cpu.infocpu_model, cpu_vendorvalue
dcgm.cpu_core.cpu.powercpu_power_limit, cpu_power_util_currentWatts
dcgm.cpu_core.cpu.temperaturecpu_temp_critical, cpu_temp_current, cpu_temp_warningCelsius
dcgm.cpu_core.cpu.utilizationcpu_util, cpu_util_irq, cpu_util_nice, cpu_util_sys, cpu_util_user%
dcgm.cpu_core.diagnostics.resultsdiag_cpu_eud_resultstate

Per exporter

These metrics refer to exporter/global instances.

Labels:

LabelDescription
jobjob label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.exporter.health.statusbind_unbind_eventstate
dcgm.exporter.inventory.softwarecuda_driver_version, driver_version, nvml_versionvalue

Alerts

The following alerts are available:

Alert nameOn metricDescription
dcgm_gpu_xid_errorsdcgm.gpu.reliability.xidNVIDIA driver reported GPU XID error on GPU ${label:gpu}
dcgm_gpu_row_remap_failuredcgm.gpu.reliability.row_remap_statusGPU row remapping failed on GPU ${label:gpu}
dcgm_gpu_uncorrectable_remapped_rowsdcgm.gpu.reliability.row_remap_eventsUncorrectable remapped rows increased on GPU ${label:gpu}
dcgm_gpu_power_violationdcgm.gpu.throttle.violationsPower throttling detected on GPU ${label:gpu}
dcgm_gpu_thermal_violationdcgm.gpu.throttle.violationsThermal throttling detected on GPU ${label:gpu}

Troubleshooting

Debug Mode

Important: Debug mode is not supported for data collection jobs created via the UI using the Dyncfg feature.

To troubleshoot issues with the dcgm collector, run the go.d.plugin with the debug option enabled. The output should give you clues as to why the collector isn’t working.

  • Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that’s not the case on your system, open netdata.conf and look for the plugins setting under [directories].

    cd /usr/libexec/netdata/plugins.d/
    
  • Switch to the netdata user.

    sudo -u netdata -s
    
  • Run the go.d.plugin to debug the collector:

    ./go.d.plugin -d -m dcgm
    

    To debug a specific job:

    ./go.d.plugin -d -m dcgm -j jobName
    

Getting Logs

If you’re encountering problems with the dcgm collector, follow these steps to retrieve logs and identify potential issues:

  • Run the command specific to your system (systemd, non-systemd, or Docker container).
  • Examine the output for any warnings or error messages that might indicate issues. These messages should provide clues about the root cause of the problem.

System with systemd

Use the following command to view logs generated since the last Netdata service restart:

journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep dcgm

System without systemd

Locate the collector log file, typically at /var/log/netdata/collector.log, and use grep to filter for collector’s name:

grep dcgm /var/log/netdata/collector.log

Note: This method shows logs from all restarts. Focus on the latest entries for troubleshooting current issues.

Docker Container

If your Netdata runs in a Docker container named “netdata” (replace if different), use this command:

docker logs netdata 2>&1 | grep dcgm

The observability platform companies need to succeed

Sign up for free

Want a personalised demo of Netdata for your use case?

Book a Demo