XSEDE16 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Accelerating Discovery [clear filter]
Thursday, July 21

10:30am EDT

AD: Towards a Methodology for Cross-Accelerator Performance Profiling
The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of accelerator architectures for high performance computing (HPC). Consequently, it is now common for the compute nodes of HPC clusters to be comprised of multiple processing elements, including accelerators. Although execution time can be used to compare the performance of different processing elements, there exists no standard way to analyze application performance across processing elements with very different architectural designs and, thus, understand why one outperforms another. Without this knowledge, a developer is handicapped when attempting to effectively tune application performance as is a hardware designer when trying to understand how best to improve the design of processing elements. In this paper, we use the LULESH 1.0 proxy application to compare and analyze the performance of three different accelerators: the Intel Xeon Phi and the NVIDIA Kepler and Fermi GPUs. Our study shows that LULESH 1.0 exhibits similar runtime behavior across the three accelerators, but runs up to 7x faster on the Kepler. Despite the significant architectural differences between the Xeon Phi and the GPUs, and the differences in the metrics used to characterize the performance of these architectures, we were able to quantify why the Kepler outperforms both the Fermi and the Xeon Phi. To do this, we compared their achieved instructions per cycle and vectorization efficiency, as well as their memory behavior and power and energy consumption.

Thursday July 21, 2016 10:30am - 11:00am EDT
Sevilla InterContinental Miami

11:00am EDT

AD: Estimating the Accuracy of User Surveys for Assessing the Impact of HPC Systems
Each year, the Computational & Information Systems Laboratory (CISL) conducts a survey of its current and recent user community to gather a number of metrics about the scientific impact and outcomes from the use of CISL’s high-performance computing systems, particularly peer-reviewed publications. However, with a modest response rate and reliance on self-reporting by users, the accuracy of the survey is uncertain as is the degree of that uncertainty. To quantify this uncertainty, CISL undertook a project that attempted to provide statistically supported limits on the accuracy and precision of the survey approach. We discovered limitations related to the range of users’ HPC usage in our modeling phase, and several methods were attempted to adjust the model to fit the usage data. The resulting statistical models leverage data about the HPC usage associated with survey invitees to quantify the degree to which the survey undercounts the relevant publications. A qualitative assessment of the collected publications aligns with the statistical models, reiterates the challenges associated with acknowledgment for use of HPC resources, and suggests ways to improve the survey results further.

avatar for David Hart

David Hart

User Services Section Manager, National Center for Atmospheric Research

Thursday July 21, 2016 11:00am - 11:30am EDT
Sevilla InterContinental Miami

11:30am EDT

AD: Minimization of Xeon Phi Core Use with Negligible Execution Time Impact
For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel Xeon Phi been included (Tianhe-2 and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while the Xeon Phi is in 6%. Intel came out with Xeon Phi to compete with NVIDIA GPUs by offering a unified environment that supports OpenMP and MPI, and by providing competitive and easier-to-utilize processing power with less power consumption. The best Xeon Phi execution-time performance requires high data parallelism, good scalability, and the use of parallel algorithms. In addition, efficient power performance and application concurrency can be achieved by decreasing the number of cores employed for application execution. Accordingly, the objectives of this paper are to: (1) Demonstrate that some applications can be executed with fewer cores than are available to the user with a negligible impact on execution time: For 59.3% of the 27 application instances studied, doing this resulted in better performance and for 37% using less than half resulted in performance degradation of not more than 10% in the worst case. (2) Develop a tool that provides the user with the "best" number of cores to employ: We designed an algorithm and developed a plugin for the Periscope Tuning Framework, an automatic performance tuner, that for a given application provide the user with an estimation of this number. (3) Understand if performance metrics can be used to identify applications that can be executed with fewer cores with a negligible impact on execution time: We identified, via statistical analyses, the following three metrics that are indicative of this, at least for the application instances studied: low L1 Compute to Data Access ratio, i.e., the average number of computations that are performed per byte of data loaded/stored in the L1 cache, a high use of data bandwidth, and, to a lesser extent, a low vectorization intensity.

Thursday July 21, 2016 11:30am - 12:00pm EDT
Sevilla InterContinental Miami