XSEDE16 has ended
Back To Schedule
Thursday, July 21 • 11:30am - 12:00pm
AD: Minimization of Xeon Phi Core Use with Negligible Execution Time Impact

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel Xeon Phi been included (Tianhe-2 and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while the Xeon Phi is in 6%. Intel came out with Xeon Phi to compete with NVIDIA GPUs by offering a unified environment that supports OpenMP and MPI, and by providing competitive and easier-to-utilize processing power with less power consumption. The best Xeon Phi execution-time performance requires high data parallelism, good scalability, and the use of parallel algorithms. In addition, efficient power performance and application concurrency can be achieved by decreasing the number of cores employed for application execution. Accordingly, the objectives of this paper are to: (1) Demonstrate that some applications can be executed with fewer cores than are available to the user with a negligible impact on execution time: For 59.3% of the 27 application instances studied, doing this resulted in better performance and for 37% using less than half resulted in performance degradation of not more than 10% in the worst case. (2) Develop a tool that provides the user with the "best" number of cores to employ: We designed an algorithm and developed a plugin for the Periscope Tuning Framework, an automatic performance tuner, that for a given application provide the user with an estimation of this number. (3) Understand if performance metrics can be used to identify applications that can be executed with fewer cores with a negligible impact on execution time: We identified, via statistical analyses, the following three metrics that are indicative of this, at least for the application instances studied: low L1 Compute to Data Access ratio, i.e., the average number of computations that are performed per byte of data loaded/stored in the L1 cache, a high use of data bandwidth, and, to a lesser extent, a low vectorization intensity.

Thursday July 21, 2016 11:30am - 12:00pm EDT
Sevilla InterContinental Miami

Attendees (4)