In this investigation, we study how application performance is affected when jobs are permitted to share compute nodes. A series of application kernels consisting of a diverse set of benchmark calculations were run in both exclusive and node-sharing modes on the Center for Computational Research’s high-performance computing (HPC) cluster. Very little increase in runtime was observed due to job contention among application kernel jobs run on shared nodes. The small differences in runtime were quantitatively modeled in order to characterize the resource contention and attempt to determine the circumstances under which it would or would not be important. A machine learning regression model applied to the runtime data successfully fitted the small differences between the exclusive and shared node runtime data; it also provided insight into the contention for node resources that occurs when jobs are allowed to share nodes. Analysis of a representative job mix shows that runtime of shared jobs is affected primarily by the memory subsystem, in particular by the reduction in the effective cache size due to sharing; this leads to higher utilization of DRAM. Insights such as these are crucial when formulating policies proposing node sharing as a mechanism for improving HPC utilization.
XALT is a tracking tool that collects accurate, detailed, and continuous job-level and link-time data. XALT stores that data in a database and ensures that all the data collection is transparent to the users. XALT tracks libraries and object files linked by the application. A recent feature improvement in XALT allows it to also tracks external subroutines and functions called by an application. This paper describes this function-tracking implementation in XALT and showcases the kind of data and analysis that becomes available from this new feature. A recently developed web-based interface to XALT database is also described, allowing center's staffs to more easily understand software usage on their compute resources.
XALT is a job-monitoring tool to collect accurate, detailed, and continuous job level and link-time data on all MPI jobs running on the computing cluster. Due to its usefulness and complementariness to existing logs and databases, XALT has been deployed on Stampede at Texas Advanced Computing Center and other high performance computing resources around the world. The data collected by XALT can be extremely valuable to help resource providers understanding resources usages and identify patterns and insights for future improvements. However, the volume of data collected by XALT grows quickly over time on large system and presents challenges for access and analysis. In this paper, we describe development of a prototype tool for analyze and visualize XALT data. The application utilizes Spark for processing the large volume of log data and Shiny for visualizing the results over the web. The application provides an easy to use interface for users to conveniently share and communicate executable usage and patterns without prerequisite knowledge on big data technology.