A Platform Snapshot

Organization: Sandia National Laboratories
Year: 2015

On distributed, massively parallel, high-performance computing (HPC) platforms, competition for shared network and file system resources among concurrently running applications is responsible for significant performance degradation. Until Sandia National Laboratories’ Lightweight Distributed Metric Service (LDMS), no HPC monitoring tools provided the continuous system-wide platform awareness that system administrators, application developers and users need to understand and troubleshoot application resource contention, network congestion, I/O bottlenecks and associated causes of computer delays. The LDMS v2.2 is monitoring software that provides continuous, high-fidelity snapshots of system status across an entire HPC platform. These snapshots offer insights into how platform resources are being utilized, stressed or depleted due to the aggregate workload.

Sandia National Laboratories, www.sandia.gov