Temperature, fan, and PSU monitoring: predicting hardware failure

Environmental sensors on network devices are the earliest leading indicators of hardware failure. Temperature trends, fan state changes, and PSU status transitions often precede field-replaceable unit failures by hours or days. The data is not hard to collect, but the MIB landscape is fragmented across vendors, thresholds vary by platform, and inherited polling templates frequently target deprecated OIDs. A template that worked on a Catalyst 3560 can silently return nothing on a Catalyst 8500.

This article covers the MIB stack for environmental monitoring, the specific signals to poll per sensor type, vendor-specific behavior and deprecations, and the correlation patterns that distinguish a localized sensor fault from a device-wide thermal event.

The MIB stack for environmental monitoring

Three MIB layers are in active use across network equipment. Understanding how they join is the prerequisite to building a polling template that works across vendors.

ENTITY-MIB (RFC 4133) provides the physical entity inventory. Each physical component (line card, fan tray, PSU, sensor) has an entry in entPhysicalTable at .1.3.6.1.2.1.47.1.1.1. The entPhysicalDescr field gives a human-readable name. The entity index is the join key used by the other MIBs.

ENTITY-SENSOR-MIB (RFC 3433) is the standards-track MIB for sensor readings. entPhySensorValue at .1.3.6.1.2.1.99.1.1.1.5 returns the numeric reading. entPhySensorStatus at .1.3.6.1.2.1.99.1.1.1.6 returns the operational state. The sensor type (temperature, voltage, fan RPM) is encoded in entPhySensorType at .1.3.6.1.2.1.99.1.1.1.2.

Vendor extensions fill gaps. Cisco uses CISCO-ENTITY-FRU-CONTROL-MIB for power supply and fan operational state on modern platforms. Juniper exposes temperature via jnxOperatingTemp at .1.3.6.1.4.1.2636.3.1.13.1.7 in JUNIPER-MIB.

flowchart TD
    A["ENTITY-MIB entPhysicalTable
.1.3.6.1.2.1.47.1.1.1"] -->|join by entity index| B["ENTITY-SENSOR-MIB
.1.3.6.1.2.1.99.1.1.1"] B --> C["entPhySensorValue
numeric reading"] B --> D["entPhySensorType
sensor type filter"] B --> E["entPhySensorStatus
health state"] A -->|join by entity index| F["Vendor FRU MIBs"] F --> G["PSU operational state"] F --> H["Fan operational state"]

The common pattern: walk ENTITY-MIB to build the entity inventory, then join sensor readings and operational states from ENTITY-SENSOR-MIB using the shared entity index. Vendor FRU MIBs attach to the same entity index for PSU and fan state where ENTITY-SENSOR-MIB does not cover them.

Temperature monitoring

Temperature is the primary leading indicator. A sustained upward trend on an inlet or internal sensor is the earliest sign of cooling degradation, dust accumulation, or impending component failure.

What to poll

Walk entPhySensorValue across the device to get all sensor readings. Use entPhySensorType to filter for temperature sensors (the celsius type per RFC 3433). Cross-reference with entPhysicalDescr to label each sensor by its physical location (inlet, outlet, CPU, switch chip).

# Walk all sensor values (SNMPv3 recommended in production)
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.99.1.1.1.5
# Walk entity descriptions for labeling
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.47.1.1.1.1.2
# Juniper-specific temperature walk
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.2636.3.1.13.1.7

Thresholds

Threshold values vary by platform and sensor. Use vendor-defined thresholds, not arbitrary absolute numbers. RFC 3433 does not define threshold objects; thresholds come from vendor-specific MIBs. Where available, alert above the high-critical value.

These Arista EOS defaults illustrate the range of thresholds you will encounter across a single chassis:

Sensor typeTypical alert thresholdTypical critical threshold
Front-panel temp65 C75 C
Fan controller temp75 C85 C
Switch chip temp105 C115 C
VRM temp105 C110 C

On Arista EOS, hitting the alert threshold ramps all fans to maximum speed and logs a warning. Hitting the critical threshold shuts down the affected component immediately and the status LED flashes orange.

Inlet vs outlet

Inlet temperature rising across multiple devices in the same rack or row points to a datacenter HVAC issue, not a device fault. Outlet temperature rising on one device while neighbors stay stable points to a device-specific cooling failure (failed fan, dust buildup, degraded thermal interface). Always track sensor location to distinguish these.

Fan monitoring

Fan failure removes cooling capacity and will cause temperature to rise on every downstream sensor in the device. It is the most common precursor to thermal shutdown.

What to poll

Fan state comes from vendor FRU MIBs, not ENTITY-SENSOR-MIB. On Cisco modern platforms, use CISCO-ENTITY-FRU-CONTROL-MIB. The ENTITY-SENSOR-MIB operational status field covers some fan entities but is not universally populated.

Actionability rules:

  • Fan state not running is always actionable.
  • Fan failed on a redundant fan tray: TICKET (degraded cooling, not an emergency).
  • Fan failed on non-redundant cooling: PAGE (thermal failure imminent).

RPM trend

Where fan RPM is exposed (via entPhySensorValue for fan-type sensors), track the trend. A fan whose RPM is steadily declining over days or weeks is failing mechanically even before it reports a fault state. This is one of the few cases where the numeric sensor value matters more than the operational status boolean.

PSU monitoring

PSU state tells you about redundancy and power delivery health. A PSU transitioning from online to a fault state means the device has lost redundancy or is running on a single supply.

What to poll

On Cisco modern platforms, PSU state comes from CISCO-ENTITY-FRU-CONTROL-MIB. Actionability rules:

  • PSU state not online (value 2 in cefcFRUPowerOperStatus) is always actionable.
  • PSU offline on a non-redundant device: PAGE (power failure imminent).
  • PSU offline on a redundant system: TICKET (redundancy lost, not an emergency).

AC loss is not always a failure

On Arista EOS, show system environment all can show a PSU with AC loss (input current = 0) but an OK output status. This is expected behavior for redundant supplies drawing from different power sources. Do not auto-escalate AC loss as a critical event unless the PSU count and redundancy policy warrant it. Check the output status, not just the input state.

Vendor-specific behavior and gotchas

CISCO-ENVMON-MIB is deprecated on modern platforms

The classic CISCO-ENVMON-MIB OIDs under .1.3.6.1.4.1.9.9.13.* are not supported on Catalyst 8500, NCS 5500, and other modern Cisco routing platforms. Cisco moved environmental monitoring to ENTITY-SENSOR-MIB (for numeric readings) and CISCO-ENTITY-FRU-CONTROL-MIB (for PSU and fan operational state).

On classic IOS and IOS-XE platforms (ISR, Catalyst 2960/3560/3750), and on IOS-XR with caveats (ASR 9000), the legacy OIDs still work:

  • Temperature: .1.3.6.1.4.1.9.9.13.1.3.1.3
  • PSU status: .1.3.6.1.4.1.9.9.13.1.5.1.3
  • Fan status: .1.3.6.1.4.1.9.9.13.1.4.1.3

Enable EnvMon traps explicitly with snmp-server enable traps envmon. Traps are not forwarded by default.

False-positive temperature alerts on ASR 9000 and 7600

Persistent false-positive hardware health warnings have been reported on ASR 9001 and 7606 routers when polled via both CISCO-ENVMON-MIB and CISCO-ENTITY-SENSOR-MIB simultaneously. The workaround is to align polling to whichever MIB the device firmware prefers and disable the other poller. Updating the MIB database on your monitoring platform also resolves many false positives.

Catalyst 3560 V1 vs V2 OID divergence

The fan status OID differs between Catalyst 3560 V1 and V2 hardware despite identical model numbers:

  • V1: .1.3.6.1.4.1.9.9.1.222.1.1.3
  • V2: .1.3.6.1.4.1.9.9.13.1.4.1.3

Instrument both OIDs or fingerprint the hardware version before assigning the monitoring template.

Sensor scaling

Sensor values in ENTITY-SENSOR-MIB are scaled integers. Always read entPhySensorScale and entPhySensorPrecision to convert the raw integer to the actual unit. Many monitoring tools handle this automatically. Custom scripts must apply the multiplier explicitly or they will report values that are off by orders of magnitude.

Juniper ambient temperature configuration

Juniper Junos allows configuring the expected ambient temperature:

set chassis ambient-temperature (25C | 40C | 55C)

This setting adjusts the thermal thresholds the device uses internally. Ensure your monitoring thresholds match what the device considers normal for the configured ambient.

Palo Alto Networks

ENTITY-SENSOR-MIB is the recommended MIB for PAN-OS 10.1+. The join to ENTITY-MIB is via entPhysicalDescr mapped to entPhySensorStatus.

Trap-based alerting caveat

Some Broadcom SD-WAN platforms (formerly Viptela) and certain management center platforms stopped passing the human-readable entity name in CISCO-ENTITY-MIB traps. The trap carries only the numeric entity index, which complicates alert triage. Confirm whether your platform passes descriptive strings before relying on trap-based alerting for environmental events.

Correlation patterns

Two correlation axes distinguish useful alerts from noise.

One bad sensor vs device-wide issue. When a single temperature sensor spikes but others on the same device remain stable, the problem is likely that sensor’s local cooling (a nearby fan failed, a heatsink degraded). When multiple sensors across the same device rise together, the device is experiencing a systemic thermal event (HVAC failure, multiple fan failures, or ambient temperature exceeding design limits).

Sensor state vs syslog/trap confirmation. Environmental polling should be correlated with hardware-failure syslog messages and SNMP traps. A temperature reading above threshold that is also accompanied by a syslog hardware alarm is a confirmed event. A temperature reading above threshold with no corroborating syslog may be a sensor calibration issue or a polling artifact (see the ASR 9000 false-positive pattern above).

# Check device-side environment state (read-only, safe)
ssh <device> 'show environment all'
# Verify Juniper ambient configuration
ssh <device> 'show chassis environment'

Signals to watch in production

SignalWhy it mattersWarning sign
Temperature value per sensorLeading indicator of thermal failureSustained upward trend, or any reading above vendor critical threshold
Temperature operational statusBinary sensor health stateTransition from ok to non-ok state
Inlet temperature across rack or rowDatacenter HVAC healthMultiple devices showing simultaneous inlet rise
Outlet temperature per deviceDevice-specific cooling healthOne device rising while neighbors stay stable
Fan operational stateCooling capacityFan state not running
Fan RPM trendMechanical wear predictionRPM declining over days or weeks before fault state
PSU operational statePower redundancyPSU state not online
PSU input statusInput power healthAC loss with OK output is expected on redundant supplies
SNMP trap: environmental threshold exceededPush notification of threshold crossingReception depends on explicit trap enablement
Syslog: hardware alarm messagesDevice-asserted eventCorrelate with sensor polling for confirmation

How Netdata helps

  • Netdata’s SNMP collector polls ENTITY-SENSOR-MIB entPhySensorValue and entPhySensorStatus across all physical entities, joining temperature readings with entity descriptions from ENTITY-MIB for labeled, per-sensor charts.
  • Temperature trends are stored at per-second resolution, making gradual drift visible long before a threshold is crossed. Fan RPM decline over days or weeks is detectable in the trend view without manual threshold tuning.
  • When a PSU transitions to a fault state, Netdata correlates the environmental signal with device-level syslog and SNMP trap events in the same timeline, so you can see whether the PSU failure was preceded by a power event, a temperature spike, or a fan failure.
  • Per-sensor anomaly detection flags deviations from the learned baseline for each specific sensor, accounting for the fact that normal operating temperature differs between a switch chip and a fan controller.
  • Alert templates can differentiate between redundant and non-redundant configurations: PSU failure on a single-supply device triggers a PAGE, while the same failure on a dual-supply device triggers a TICKET.