Loading…
XSEDE16 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Accelerating Discovery [clear filter]
Tuesday, July 19
 

10:30am EDT

AD: Image Analysis and Infrastructure Support for Data Mining the Farm Security Administration – Office of War Information Photography Collection
This paper reports on the initial work and future trajectory of the Image Analysis of the Farm Security Administration – Office of War Information Photography Collection team, supported through an XSEDE startup grant and Extended Collaborative Support Service (ECSS). The team is developing and utilizing existing algorithms and running them on Comet to analyze the Farm Security Administration - Office of War Information image corpus from 1935-1944, held by the Library of Congress (LOC) and accessible online to the public. The project serves many fields within the humanities, including photography, art, visual rhetoric, linguistics, American history, anthropology, and geography, as well as the general public. Through robust image, metadata, and lexical semantics analysis, researchers will gain deeper insight into photographic techniques and aesthetics employed by FSA photographers, editorial decisions, and overall collection content. By pairing image analysis with metadata analysis, including lexiosemantic extraction, the opportunities for deep data mining of this collection expand even further.


Tuesday July 19, 2016 10:30am - 11:00am EDT
Chopin Ballroom

11:00am EDT

AD/VIS: UrbanFlow: Large-scale Framework to Integrate Social Media and Authoritative Landuse Maps
Users of micro-blogging services and content sharing platforms are generating massive amount of Geotagged information on a daily basis. Although these big data streams are not intended as a source of Geospatial information, researchers have found that ambient geographic information (AGI) complements authoritative sources. In this regard, the digital footprints of users provides a real time monitoring of people activities and their spatial interaction, while more traditional sources such as remote sensing and land use maps provide a synoptic view of the physical infrastructure of the urban environment. Traditionally trained scientists in social science and geography usually face great challenges when experimenting with new methods to synthesize big data sources because of the data volume and its lack of a structure. In order to overcome these challenges we developed UrbanFlow, a platform that allows scientists to synthesize massive Geolocated Twitter data with detailed land use maps. This platform would allow scientists to gather observations to better understand human mobility patterns in relation to urban land use, study cities’ spatial networks based on identifying common frequent visitors between different urban neighborhoods and monitoring the patterns of urban land use change. A key aspect of UrbanFlow is utilizing the power of distributed computing (using Apache Hadoop and cloud-based services) to process massive number of tweets and integrate them with authoritative datasets, as well as efficiently store them in a database cluster to facilitate fast interaction with users.


Tuesday July 19, 2016 11:00am - 11:30am EDT
Chopin Ballroom

11:30am EDT

AD: Accelerating TauDEM for Extracting Hydrology Information from National-Scale High Resolution Topographic Dataset
With the advent of DEMs with finer resolution and higher accuracy to represent surface elevation, we face an enormous need to have optimized parallel hydrology algorithms that are imminent to be able to process big DEM data efficiently. TauDEM (Terrain Analysis Using Digital Elevation Models) is a suite of Digital Elevation Model (DEM) tools for the extraction and analysis of hydrologic information from topography. We present performance improvements on parallel hydrology algorithms in TauDEM suite that allowed us to process very big DEM data. The parallel algorithms are improved by applying block-wise data decomposition technique, improving communication model and parallel I/O enhancements to obtain maximum performance from available computational and storage resources at supercomputer systems. After the improvements, as a case study, we successfully filled the depressions of entire US 10-meter DEM data (667GB, 180 billion raster cells) within 2 hours that shows a significant improvement compared to the previous parallel algorithm that was unable to do the same task within 2 days using 4,096 processor cores on Stampede supercomputer. We report the methodology and make the performance analysis of the algorithm improvements.


Tuesday July 19, 2016 11:30am - 12:00pm EDT
Chopin Ballroom

3:30pm EDT

AD: Optimization of non-linear image registration in AFNI
The Analysis of Functional Neuroimaging (AFNI) is a widely adopted software package in the fMRI data analysis community. For many types of analysis pipelines, one important step is to register a subject's image to a pre-defined template so different images can be compared within a normalized coordination system. Although a 12-point affine transformation works fine for some standard cases, it is usually found insufficient for voxelwise types of analyses. This is especially challenging if the subject has brain atrophy due to some kinds of neurological condition such as Parkinson's disease. The 3dQwarp code in AFNI is a non-linear image registration procedure that overcomes the drawbacks of a canonic affine transformation. However, the existing OpenMP instrumentation in 3dQwarp is not efficient for warping at an ultra fine level, and the constant trip counts of the iterative algorithm also hurts the accuracy. Based on the profiling and benchmark analysis, we improve the parallel efficiency by the optimization of its OpenMP structure and obtain about 2x speedup for normalized workload. With the incorporation of a convergence criteria, we are able to perform warping at a much finer level beyond the default threshold and achieve about 20% improvement in term of Pearson correlation.


Tuesday July 19, 2016 3:30pm - 4:00pm EDT
Chopin Ballroom

4:00pm EDT

AD: Optimization and parallel load balancing of the MPAS Atmosphere weather and climate code
MPAS (Model for Prediction Across Scales) Atmosphere is a highly scalable application for global weather and climate simulations. It uses an unstructured Voronoi mesh in the horizontal dimensions, thereby avoiding problems associated with traditional rectilinear grids, and deploys a subset of the atmospheric physics used in WRF. In this paper, we describe work that was done to improve the overall performance of the software: serial optimization of the dynamical core and thread-level load balancing of the atmospheric physics. While the overall reductions were modest for standard benchmarks, we expect that the contributions will become more important with the eventual addition of atmospheric chemistry or when running at larger scale.


Tuesday July 19, 2016 4:00pm - 4:30pm EDT
Chopin Ballroom

4:30pm EDT

AD: Time Propagation of Partial Differential Equations Using the Short Iterative Lanczos Method and Finite-Element Discrete Variable Representation
The short iterative Lanczos method has been combined with the finite-element discrete variable representation to yield a powerful approach to solving the time-dependent Schroedinger equation. It has been applied to the interaction of short, intense laser radiation(attosecond pulses) to describe the single and double ionization of atoms and molecules, but the approach is not limited to this particular application. The algorithm will be described in some detail and how it been successfully ported to the Intel Phi coprocessors. While further experimentation is needed, the current results provide reasonable evidence that by suitably modifying the code to combine MPI, OpenMP, and compiler offload directives, one can achieve significant improvement in performance from these coprocessors for problems such as the above.


Tuesday July 19, 2016 4:30pm - 5:00pm EDT
Chopin Ballroom
 
Wednesday, July 20
 

8:30am EDT

AD: A Parallel Evolutionary Algorithm for Subset Selection in Causal Inference Models
Science is concerned with identifying causal inferences. To move beyond simple observed relationships and associational inferences, researchers may employ randomized experimental designs to isolate a treatment effect, which then permits causal inferences. When experiments are not practical, a researcher is relegated to analyzing observational data. To make causal inferences from observational data, one must adjust the data so that they resemble data that might have emerged from an experiment. Traditionally, this has occurred through statistical models identified as matching methods. We claim that matching methods are unnecessarily constraining and propose, instead, that the goal is better achieved via a subset selection procedure that is able to identify statistically indistinguishable treatment and control groups. This reformulation to identifying optimal subsets leads to a model that is computationally complex. We develop an evolutionary algorithm that is more efficient and identifies empirically more optimal solutions than any other causal inference method. To gain greater efficiency, we also develop a scalable algorithm for a parallel computing environment by enlisting additional processors to search a greater range of the solution space and to aid other processors at particularly difficult peaks.


Wednesday July 20, 2016 8:30am - 9:00am EDT
Chopin Ballroom

9:00am EDT

AD: Three Dimensional Simulations of Fluid Flow and Heat Transfer with Spectral Element Method
This paper presents a computational approach for simulating three dimensional fluid flow and convective heat transfer involving viscous heating and Boussinesq approximation for buoyancy term. The algorithm was implemented with a modal spectral element method for accurate resolutions to coupled nonlinear partial differential equations. High order time integration schemes were used for time derivatives. Simulation results were analyzed and verified. They indicate that this approach is viable for investigating convective heat transfer subject to complex thermal and flow boundary conditions in three dimensional irregular domains.


Wednesday July 20, 2016 9:00am - 9:30am EDT
Chopin Ballroom

9:30am EDT

AD: Performance and Scalability Analysis for Parallel Reservoir Simulations on Three Supercomputer Architectures
In this work, we tested the performance and scalability on three supercomputers of different architectures including SDSC’s Comet, SciNet’s GPC and IBM’s Blue Gene/Q systems, through benchmarking parallel reservoir simulations. The Comet and GPC systems adopt a fat-tree network and they are connected with InfiniBand interconnects technology. The Blue Gene/Q uses a 5-dimensional toroidal network and it is connected with custom interconnects. In terms of supercomputer architectures, these systems represent two main interconnect families: fat-tree and torus. To demonstrate the application scalability for supercomputers with today’s diversified architectures, we benchmark a parallel black oil simulator that is extensively used in the petroleum industry. Our implementation for this simulator is coded in C and MPI, and it offers grids, data, linear solvers, preconditioners, distributed matrices and vectors and modeling modules. Load balancing is based on the Hilbert space-filling curve (HSFC) method. Krylov subspace and AMG solvers are implemented, including restarted GMRES, BiCGSTAB, and AMG solvers from Hypre. The results show that the Comet is the fastest supercomputer among tested systems and the Blue Gene/Q has the best parallel efficiency. The scalability analysis helps to identify the performance barriers for different supercomputer architectures. The study of testing the application performance serves to provide the insights for carrying out parallel reservoir simulations on large-scale computers of different architectures.


Wednesday July 20, 2016 9:30am - 10:00am EDT
Chopin Ballroom

10:30am EDT

AD: A Scalable High-performance Topographic Flow Direction Algorithm for Hydrological Information Analysis
Hydrological information analysis based on Digital Elevation Models (DEM) provide hydrological properties derived from high-resolution topographic data represented as elevation grid. Flow direction detection is one of the most computationally intensive functions. As the resolution of DEM becomes higher, the computational bottleneck of this function hinders the use of these DEM data in large-scale studies. As the computation of flow directions for the study extent needs global information, the parallelization involves iterative communications. This paper presents an efficient parallel flow direction detection algorithm that identifies spatial features (e.g., flats) that can or cannot be computed locally. An efficient sequential algorithm is then applied to resolve those local features, while communication is applied to compute non-local features. This strategy significantly reduces the number of iterations needed in the parallel algorithm. Experiments show that our algorithm outperformed the best existing parallel (i.e., the d8 algorithm in TauDEM) by two orders of magnitude. The parallel algorithm exhibited desirable scalability on Stampede and ROGER supercomputer.


Wednesday July 20, 2016 10:30am - 11:00am EDT
Chopin Ballroom

11:00am EDT

AD: Implementation of Simple XSEDE-Like Clusters: Science Enabled and Lessons Learned
The Extreme Science and Engineering Discovery Environment (XSEDE) has created a suite of software that is collectively known as the XSEDE-Compatible Basic Cluster (XCBC). It is designed to enable smaller, resource-constrained research groups or universities to quickly and easily implement a computing environment similar to XSEDE computing resources.The XCBC acts as both an enabler of local research and as a springboard for seamlessly moving researchers onto XSEDE resources when the time comes to scale up their efforts to larger hardware. The XCBC system consists of the Rocks Cluster Manager, developed at the San Diego Supercomputer Center for use on Gordon and Comet, and an XSEDE-specific “Rocks Roll'', containing a selection of libraries, compilers, and scientific software curated by the Campus Bridging (CB) group. The versions of software included in the roll are kept up to date with those implemented on XSEDE resources.
The Campus Bridging team has helped several universities implement the XCBC, and finds the design to be extremelyuseful for resource-limited (in time, administrator knowledge, ormoney) research groups or institutions. Here, we detail our recentexperiences in implementing the XCBC design at university campuses across the country. These XCBC implementations were carried out with Campus Bridging staff traveling on-site to the partner institutions to directly assist with the cluster build. Results from a site visits at partner institutions show how the Campus Bridging team helped accelerate cluster implementation and research by providing expertise and hands-on assistance during cluster building. We also describe how, following a visit from Campus Bridging staff, the XCBC has accelerated research and discovery at our partner institutions.


Wednesday July 20, 2016 11:00am - 11:30am EDT
Chopin Ballroom

11:30am EDT

AD: Using High Performance Computing To Model Cellular Embryogenesis
C. Elegans is a primitive multicellular organism (worm) that shares many important biological characteristics that arise as complications within human beings. It begins as a single cell and then undergoes a complex embryogenesis to form a complete animal. Using experimental data, the early stages of life of the cells are simulated by computers. The goal of this project is to use this simulation to compare the embryogenesis stage of C. Elegans cells with that of human cells. Since the simulation involves the manipulation of many files and large amounts of data, the power provided by supercomputers and parallel programming is required. The serial agent based simulation program NetLogo completed the simulation in roughly six minutes. By comparison, using the parallel agent based simulation toolkit RepastHPC, the simulation completed in under a minute when executing on four processors of a small cluster. Unlike NetLogo, RepastHPC does not contain a visual element. Therefore, a visualization program, VisIt, was used to graphically show the data produced by the RepastHPC simulation.


Wednesday July 20, 2016 11:30am - 12:00pm EDT
Chopin Ballroom

3:30pm EDT

AD: Tools for studying populations and timeseries of neuroanatomy enabled though GPU acceleration in the Computational Anatomy Gateway
The Computational Anatomy Gateway is a software as a service tool for medical imaging researchers to quantify changes in anatomical structures over time, and through the progression of disease. GPU acceleration on the Stampede cluster has enabled the development of new tools, combining advantages of grid based and particle based methods for describing fluid flows, and scaling up analysis from single scans to populations and time series. We describe algorithms for estimating average anatomies, and for quantifying atrophy rate over time. We report code performance on different sized datasets, revealing that the number vertices in a triangulated surface presents a bottleneck to our computation. We show results on an example dataset, quantifying atrophy in the entorhinal cortex, a medial temporal lobe brain region whose structure is sensitive changes in early Alzheimer's disease.


Wednesday July 20, 2016 3:30pm - 4:00pm EDT
Chopin Ballroom

4:00pm EDT

AD: Delayed Update Algorithms for Quantum Monte Carlo Simulation on GPU
QMCPACK is open source scientific software designed to perform Quantum Monte Carlo simulation, a first-principles method for describing many-fermion systems. The evaluation of each Monte Carlo move requires finding the determinant of a dense matrix of wave functions. This calculation forms a key computational kernel in QMCPACK. After each accepted event, the wave function matrix undergoes a rank-one update to represent a single particle move within the system. The Sherman-Morrison formula is used to update the matrix inverse. Occasionally, the explicit inverse must be recomputed to maintain numerical stability. An alternate approach to this kernel utilizes QR factorization to maintain stability without re-factorization.

 

Algorithms based on a novel delayed update scheme are explored in this effort. This strategy involves calculating probabilities for multiple successive Monte Carlo moves and delaying their application to the matrix of wave functions until an event is denied or a predetermined limit of acceptances is reached. Updates grouped in this manner are then applied to the matrix en bloc to achieve enhanced computational intensity.

 

GPU-accelerated delayed update algorithms are tested and profiled for both Sherman-Morrison and QR based probability evaluation kernels. Results are evaluated against existing methods for numerical stability and efficiency; emphasis is placed on large systems, where acceleration is critical.


Wednesday July 20, 2016 4:00pm - 4:30pm EDT
Chopin Ballroom

4:30pm EDT

AD: Efficient Primitives for Standard Tensor Linear Algebra
This paper presents the design and implementation of low-levellibrary to compute general sums and products over multi-dimensional arrays (tensors). Using only 3 low-level functions, the API at once generalizes core BLAS1-3 as well as eliminates the need for most tensor transpositions. Despite their relatively low operation count, we show that these transposition steps can become performance limiting in typical use cases for BLAS on tensors. The execution of the present API achieves peak performance on the same order of magnitude (teraflops) as for vendor-optimized GEMM by utilizing a code generator to output CUDA source code for all computational kernels. The outline for these kernels is a multi-dimensional generalization of the MAGMA BLAS matrix multiplication on GPUs. Separate transpositions steps can be skipped because every kernel allows arbitrary multi-dimensional transpositions of the arguments. The library, including its methodology and programming techniques, are made available in SLACK. Future improvements to the library include a high-level interface to translate directly from a \LaTeX{}-like equation syntax to a data-parallel computation.

Speakers

Wednesday July 20, 2016 4:30pm - 5:00pm EDT
Chopin Ballroom
 
Thursday, July 21
 

10:30am EDT

AD: Computational Considerations in Transcriptome Assemblies and Their Evaluation, using High Quality Human RNA-Seq Data
It is crucial to understand the performance of transcriptome assemblies to improve current practices. Investigating the factors that affect a transcriptome assembly is very important and is the primary goal of our project. To that end, we designed a multi-step pipeline consisting of variety of pre-processing and quality control steps. XSEDE allocations enabled us to achieve the computational demands of the project. The high memory Blacklight and Greenfield systems at Pittsburgh Supercomputing Center were essential to accomplish multiple steps of this project. This paper presents the computational aspects of our comprehensive transcriptome assembly and validation study.


Thursday July 21, 2016 10:30am - 11:00am EDT
Chopin Ballroom

10:30am EDT

AD: Towards a Methodology for Cross-Accelerator Performance Profiling
The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of accelerator architectures for high performance computing (HPC). Consequently, it is now common for the compute nodes of HPC clusters to be comprised of multiple processing elements, including accelerators. Although execution time can be used to compare the performance of different processing elements, there exists no standard way to analyze application performance across processing elements with very different architectural designs and, thus, understand why one outperforms another. Without this knowledge, a developer is handicapped when attempting to effectively tune application performance as is a hardware designer when trying to understand how best to improve the design of processing elements. In this paper, we use the LULESH 1.0 proxy application to compare and analyze the performance of three different accelerators: the Intel Xeon Phi and the NVIDIA Kepler and Fermi GPUs. Our study shows that LULESH 1.0 exhibits similar runtime behavior across the three accelerators, but runs up to 7x faster on the Kepler. Despite the significant architectural differences between the Xeon Phi and the GPUs, and the differences in the metrics used to characterize the performance of these architectures, we were able to quantify why the Kepler outperforms both the Fermi and the Xeon Phi. To do this, we compared their achieved instructions per cycle and vectorization efficiency, as well as their memory behavior and power and energy consumption.


Thursday July 21, 2016 10:30am - 11:00am EDT
Sevilla InterContinental Miami

11:00am EDT

AD: Estimating the Accuracy of User Surveys for Assessing the Impact of HPC Systems
Each year, the Computational & Information Systems Laboratory (CISL) conducts a survey of its current and recent user community to gather a number of metrics about the scientific impact and outcomes from the use of CISL’s high-performance computing systems, particularly peer-reviewed publications. However, with a modest response rate and reliance on self-reporting by users, the accuracy of the survey is uncertain as is the degree of that uncertainty. To quantify this uncertainty, CISL undertook a project that attempted to provide statistically supported limits on the accuracy and precision of the survey approach. We discovered limitations related to the range of users’ HPC usage in our modeling phase, and several methods were attempted to adjust the model to fit the usage data. The resulting statistical models leverage data about the HPC usage associated with survey invitees to quantify the degree to which the survey undercounts the relevant publications. A qualitative assessment of the collected publications aligns with the statistical models, reiterates the challenges associated with acknowledgment for use of HPC resources, and suggests ways to improve the survey results further.

Speakers
avatar for David Hart

David Hart

User Services Section Manager, National Center for Atmospheric Research


Thursday July 21, 2016 11:00am - 11:30am EDT
Sevilla InterContinental Miami

11:00am EDT

AD: Improving the Scalability of a Charge Detection Mass Spectrometry Workflow
The Indiana University (IU) Department of Chemistry’s Martin F. Jarrold (MFJ) Research Group studies a specialized technique of mass spectrometry called Charge Detection Mass Spectrometry (CDMS). The goal of mass spectrometry is to determine the mass of chemical and biological compounds, and with CDMS, the MFJ Research Group is extending the upper limit of mass detection. These researchers have developed a scientific application, which accurately analyzes raw CDMS data generated from their mass spectrometer. This paper explains the comprehensive process of optimizing the group’s workflow by improving both the latency and throughput of their CDMS application. These significant performance improvements enabled high efficiency and scalability across IU’s Advanced Cyberinfrastructure; overall, this analysis and development resulted in a 25x speedup of the application.


Thursday July 21, 2016 11:00am - 11:30am EDT
Chopin Ballroom

11:30am EDT

AD: Minimization of Xeon Phi Core Use with Negligible Execution Time Impact
For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel Xeon Phi been included (Tianhe-2 and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while the Xeon Phi is in 6%. Intel came out with Xeon Phi to compete with NVIDIA GPUs by offering a unified environment that supports OpenMP and MPI, and by providing competitive and easier-to-utilize processing power with less power consumption. The best Xeon Phi execution-time performance requires high data parallelism, good scalability, and the use of parallel algorithms. In addition, efficient power performance and application concurrency can be achieved by decreasing the number of cores employed for application execution. Accordingly, the objectives of this paper are to: (1) Demonstrate that some applications can be executed with fewer cores than are available to the user with a negligible impact on execution time: For 59.3% of the 27 application instances studied, doing this resulted in better performance and for 37% using less than half resulted in performance degradation of not more than 10% in the worst case. (2) Develop a tool that provides the user with the "best" number of cores to employ: We designed an algorithm and developed a plugin for the Periscope Tuning Framework, an automatic performance tuner, that for a given application provide the user with an estimation of this number. (3) Understand if performance metrics can be used to identify applications that can be executed with fewer cores with a negligible impact on execution time: We identified, via statistical analyses, the following three metrics that are indicative of this, at least for the application instances studied: low L1 Compute to Data Access ratio, i.e., the average number of computations that are performed per byte of data loaded/stored in the L1 cache, a high use of data bandwidth, and, to a lesser extent, a low vectorization intensity.


Thursday July 21, 2016 11:30am - 12:00pm EDT
Sevilla InterContinental Miami

11:30am EDT

AD: Scaling GIS analysis tasks from the desktop to the cloud utilizing contemporary distributed computing and data management approaches: A case study of project-based learning and cyberinfrastructure concepts
In this paper we present the experience of scaling in parallel a geographic information system modeling framework to hundreds of processors. The project began in an active learning cyberinfrastructure course which was followed by an XSEDE ECSS effort in collaboration across multiple-institutions.


Thursday July 21, 2016 11:30am - 12:00pm EDT
Chopin Ballroom