Education

University of Calgary: Netdata in Academia

Enhancing Performance and Reducing Downtime in Academia

About The Calgary Machine Learning Lab

  • A leading research facility at the University of Calgary
  • Specializes in machine learning and artificial intelligence
  • Utilizes a fleet of high-performance servers and workstations for graduate research

Industry

Education

Story Snapshot

  • Key improvements: Enhanced server monitoring, Reduced downtime
  • Main features used: GPU usage dashboards, Critical issue alerts, Downtime alerts
  • Impact: Significantly reduced downtime from days to hours, Improved productivity and resource utilization
Ready to empower your research lab with real-time monitoring and alerts? Find out more

Empowering Academic Research with Real-Time Monitoring

At the Machine Learning Lab at the University of Calgary, managing a robust infrastructure of servers and workstations is critical for advancing their research in machine learning (ML). Assistant Professor Yani Ioannou oversees this infrastructure, ensuring that these essential resources remain operational and are used to their fullest potential. The lab’s challenge lies not just in the maintenance of these resources but in minimizing downtime to keep the research moving forward.

“The vast number of servers and workstations under our purview makes it impractical to manually check their status on a daily basis. Furthermore, understanding the utilization and environmental factors affecting our GPU servers is crucial for the efficient operation of our research lab,”

Yani Ioannou, Assistant Professor

University of Calgary, Calgary Machine Learning Lab

This need is further amplified by the requirement to efficiently allocate resources for the students' research projects, enabling them to plan and execute their experiments effectively.

Leveraging Netdata for Enhanced Operational Efficiency

To address these challenges, the University of Calgary’s Machine Learning Lab adopted Netdata, a comprehensive monitoring solution. Netdata’s platform provides automated monitoring of critical issues, delivering alerts for faults that might otherwise go unnoticed. Additionally, its detailed dashboards for GPU usage allow for better planning and allocation of computational resources for both students and researchers.

Netdata automates the monitoring of critical issues, providing alerts for faults that might not even be obvious upon manual inspection. Furthermore, it offers a mechanism for students to understand the utilization of our servers and resources, enabling them to better plan their experiments,

Yani Ioannou, Assistant Professor

University of Calgary, Calgary Machine Learning Lab

With Netdata, the lab effectively monitors a range of critical metrics across its infrastructure, including server temperature, CPU/memory utilization, GPU utilization and temperature, and system uptime. The most impactful features for the lab are the dashboard for GPU usage summary, alerts for downtime, and alerts for critical issues. These capabilities have led to substantial improvements across several areas:

  • Productivity: Researchers and students can focus more on innovation rather than troubleshooting.
  • Performance Optimization: Ensures resources are utilized efficiently for machine learning computations.
  • Downtime Reduction: Quick alerts reduce the time to respond to and rectify issues, minimizing disruptions.
  • Troubleshooting Ease: Makes identifying and solving problems simpler and faster.

A recent example highlighted the value of Netdata’s monitoring capabilities:

Recently, Netdata alerted us to a persistent PCIe error in a server that appeared to operate normally. This early detection led us to discover a failed GPU fan, preventing significant downtime and potential equipment damage,

Yani Ioannou, Assistant Professor

University of Calgary, Calgary Machine Learning Lab

The experience of the Calgary Machine Learning Lab with Netdata is a testament to the crucial role of real-time monitoring and alerts in maintaining high-performance computing environments crucial for academic research. By leveraging Netdata, the lab not only minimized downtime but also significantly enhanced the efficiency and productivity of its research operations, establishing a new benchmark for operational excellence in the academic domain.

Discover how Netdata can elevate your research lab or organization.

Discover More