Jetstream is undergoing its acceptance review by the National Science Foundation (NSF) at the beginning of May. We expect the system to be accepted by the NSF in short order and this abstract is written with the assumption that acceptance will be complete by the time final versions of papers are due. Here, we present the acceptance test criteria and results that define the key performance characteristics of Jetstream, describe the experiences of the Jetstream team in standing up and operating an OpenStack-based cloud environment, and describe some of the early scientific results that have been obtained by researchers and students using this system. Jetstream is a distributed production cloud resource and, as such, is a first-of-a-kind system for the NSF; it is scaled at an investment and computational capability – 0.5 PetaFLOPS peak – that is consistent with this status. While the computational capability does not stand out within the spectrum of resources funded by the NSF and supported by XSEDE, the functionality does. Jetstream offers interactive virtual machines (VMs) provided through the user-friendly Atmosphere interface. The software stack consisting of Globus for authentication and data transfer, OpenStack as the basic cloud environment, and Atmosphere as the user interface has proved very effective although the OpenStack change cycle and intentional lack of backwards compatibility creates certain implementation challenges. Jetstream is a multi-region deployment that operates as a single integrated system and is proving effective in supporting modes and subdisciplines of research traditionally underrepresented on larger XSEDE-supported clusters and supercomputers. Already researchers in biology, network science, economics, earth science, and computer science have used it to perform research – much of it research in the “long tail of science.”
Subscriber Engagement Manager, Globus, University of Chicago
I can tell you all about Globus, our services, and our subscriptions for campuses and other research organizations. I am also part of the XSEDE team, focused on user requirements management and integrating other services with XSEDE.
Hardware virtualization has been gaining a significant share of computing timein the last years. Using virtual machines (VMs) for parallel computing is anattractive option for many users. A VM gives users a freedom of choosing anoperating system, software stack and security policies, leaving the physicalhardware, OS management, and billing to physical cluster administrators. Thewell-known solutions for cloud computing, both commercial (Amazon Cloud, GoogleCloud, Yahoo Cloud, etc.) and open-source (OpenStack, Eucalyptus) provideplatforms for running a single VM or a group of VMs. With all the benefits,there are also some drawbacks, which include reduced performance when runningcode inside of a VM, increased complexity of cluster management, as well as theneed to learn new tools and protocols to manage the clusters. At SDSC, we have created a novel framework and infrastructure by providing virtualHPC clusters to projects using the NSF sponsored Comet supercomputer.Managing virtual clusters on Comet is similar to managing a bare-metal cluster in terms of processes and tools that are employed. This is beneficial becausesuch processes and tools are familiar to cluster administrators. Unlikeplatforms like AWS, Comet's virtualization capability supportsinstalling VMs from ISOs (i.e., a CD-ROM or DVD image) or via an isolatedmanagement VLAN (PXE). At the same time, we're helping projects take advantageof VMs by providing an enhanced client tool for interaction with our managementsystem called Cloudmesh client. Cloudmesh client can also be used to managevirtual machines on OpenStack, AWS, and Azure.
Big Data problems dealing with a variety of large data sets are now common in a wide range of domain science research areas such as bioinformatics, social science, astronomical image processing, weather and climate dynamics, and economics. In some cases, the data generation and computation is done on high performance computing (HPC) resources, thus presenting an incentive for developing/optimizing big data middleware and tools to take advantage of the existing HPC infrastructures. Data-intensive computing middleware (such as Hadoop, Spark) can potentially benefit greatly from the hardware already designed for high performance and scalability with advanced processor technology, large memory/core, and storage/filesystems. SDSC Comet represents such a resource with large numbers of compute nodes with fast node local SSD storage, and high performance Lustre filesystems. This paper discusses experiences and benefits of using optimized Remote Direct Memory Access (RDMA) Hadoop and Spark middleware on the XSEDE Comet HPC resource, including some performance results of Big Data benchmarks and applications. Comet is a general purpose HPC resource so some work is needed to integrate the middleware to run within the HPC scheduling framework. This aspect of the implementation is also discussed in detail.
Linux containers have garnered a considerable amount of interest in a very short amount of time. Docker, in par- ticular, has captured a lot of this momentum – and is of- ten viewed as the de-facto standard for lightweight software deployment. Containerized applications can be quickly deployed, providing a virtual software environment (as op- posed to a more traditional virtual machine) with little to no performance overhead. We believe Docker, and the modern trend toward containers and microservices it epitomizes, can provide a significant technological advantage for service deployment in an HPC environment. We present several use cases for containers, ranging from simple web application delivery to prototype compute cluster environments. Finally, we share our initial performance results, and our upcoming plans for container development at our site.
Authentication for HPC resources has always been a double edged issue. On one hand, HPC facilities would like users to login as easily as possible, but with the increase and complexity of system exploits, HPC centers would like to protect their systems to the highest degree possible, which often leads to complicated login mechanisms. While solutions like Two Factor authentication are optimal from a system administration view point, user buy in hasn't been as vociferous as one would hope. In this paper we discuss the implementation of an alternate solution, CILogon, at NICS. We start with a brief overview of CILogon and then delve into the implementation details. We discuss how we incorporated CILogon to enable users to use their campus credentials to login to the NICS user portal as well our compute resource Darter. We discuss the issues we faced during implementation and the strategies we implemented to overcome them.
Globus offers a broad suite of research data management cappabilities to the research community as web-accessible services. The initial service, launched in 2010, focused on reliable, high-performance, secure data transfer; since thattime, Globus capabilities have been progressively enhancedin response to user demand. In 2015, secure data sharing and publication services were introduced. Other recent enhancements include support for secure HTTP data access, new storage system types (e.g., Amazon S3, HDFS, Ceph), endpoint search, and administrator management. A powerful new authentication and authorization platform service,Globus Auth, addresses identity, credential, and delegation management needs encountered in research environments.New REST APIs allow external and third-party services to leverage Globus data management, authentication, and authorization capabilities as a platform, for example when building research data portals. We describe these and other recent enhancements to Globus, review adoption trends (to date, 38,000 registered users have operated on more than 150PB and 25B les), and present future plans.
The Developing Applications with Networking Capabilities via End-to-End SDN (DANCES) project was a collaboration between The University of Tennessee’s National Institute for Computational Sciences (UT-NICS), Pittsburgh Supercomputing Center (PSC), Pennsylvania State University (Penn State), the National Center for Supercomputing Applications (NCSA), Texas Advanced Computing Center (TACC), Georgia Institute of Technology (Georgia Tech), the Extreme Science and Engineering Discovery Environment (XSEDE), and Internet2 to investigate and develop a capability to add network bandwidth scheduling capability via software-defined networking (SDN) programmability to selected cyberinfrastructure services and applications. DANCES, funded by the National Science Foundation’s Campus Cyberinfrastructure – Network Infrastructure and Engineering (CC-NIE) program award numbers 1341005, 1340953, and1340981, has investigated three vendor network devices in order to determine which implements the OpenFlow 1.3 standard with DANCES requirements of meters and per-port queueing in order to provide a network reservation and rate-limiting capability desired to implement the goals of DANCES. Of the devices tested the DANCES project determined that the Corsa DP6410 met the requirements of OpenFlow 1.3 especially the implementation of features of metering and per port queueing which allow complex quality of service configuration for network flows. After selection of the network device a test environment was setup between the University of Tennessee and PSC to simulate a supercomputer center compute and data transfer resource environment. This paper described the DANCES project, the DANCES OpenFlow 1.3 specification requirements, the determination and acquiring of a sufficient OpenFlow 1.3 network device, the provisioning of a test environment, the UT-NICS test plan and the exciting and successful results of the tests.
In recent years, big data analysis has been widely applied to many research fields including biology, physics, transportation, and material science. Even though the demands for big data migration and big data analysis are dramatically increasing in campus IT infrastructures, there are several technical challenges that need to be addressed. First of all, frequent big data transmission between storage systems in different research groups imposes heavy burdens on regular campus network. Second, the current campus IT infrastructure is not designed to fully utilize the hardware capacity for big data migration and analysis. Last but not the least, running big data applications on top of large-scale high-performance computing facilities is not straightforward, especially for researchers and engineers in non-IT disciplines. We develop a campus IT infrastructure for big data migration and analysis, called BIC-LSU, which consists of a task-aware Clos OpenFlow network, high-performance cache storage servers, customized high-performance transfer applications, a light-weight control framework to manipulate existing big data storage systems and job scheduling systems, and a comprehensive social networking-enabled web portal. BIC-LSU achieves 40Gb/s disk-to-disk big data transmission, maintains short average transmission task completion time, enables the convergence of control on commonly deployed storage and job scheduling systems, and enhances easiness of big data analysis with a universal user-friendly interface. BIC-LSU software requires minimum dependencies and has high extensibility. Other research institutes can easily customize and deploy BIC-LSU as an augmented service on their existing IT infrastructures.
In this investigation, we study how application performance is affected when jobs are permitted to share compute nodes. A series of application kernels consisting of a diverse set of benchmark calculations were run in both exclusive and node-sharing modes on the Center for Computational Research’s high-performance computing (HPC) cluster. Very little increase in runtime was observed due to job contention among application kernel jobs run on shared nodes. The small differences in runtime were quantitatively modeled in order to characterize the resource contention and attempt to determine the circumstances under which it would or would not be important. A machine learning regression model applied to the runtime data successfully fitted the small differences between the exclusive and shared node runtime data; it also provided insight into the contention for node resources that occurs when jobs are allowed to share nodes. Analysis of a representative job mix shows that runtime of shared jobs is affected primarily by the memory subsystem, in particular by the reduction in the effective cache size due to sharing; this leads to higher utilization of DRAM. Insights such as these are crucial when formulating policies proposing node sharing as a mechanism for improving HPC utilization.
XALT is a tracking tool that collects accurate, detailed, and continuous job-level and link-time data. XALT stores that data in a database and ensures that all the data collection is transparent to the users. XALT tracks libraries and object files linked by the application. A recent feature improvement in XALT allows it to also tracks external subroutines and functions called by an application. This paper describes this function-tracking implementation in XALT and showcases the kind of data and analysis that becomes available from this new feature. A recently developed web-based interface to XALT database is also described, allowing center's staffs to more easily understand software usage on their compute resources.
XALT is a job-monitoring tool to collect accurate, detailed, and continuous job level and link-time data on all MPI jobs running on the computing cluster. Due to its usefulness and complementariness to existing logs and databases, XALT has been deployed on Stampede at Texas Advanced Computing Center and other high performance computing resources around the world. The data collected by XALT can be extremely valuable to help resource providers understanding resources usages and identify patterns and insights for future improvements. However, the volume of data collected by XALT grows quickly over time on large system and presents challenges for access and analysis. In this paper, we describe development of a prototype tool for analyze and visualize XALT data. The application utilizes Spark for processing the large volume of log data and Shiny for visualizing the results over the web. The application provides an easy to use interface for users to conveniently share and communicate executable usage and patterns without prerequisite knowledge on big data technology.