Loading…
XSEDE16 has ended
Wednesday, July 20 • 11:30am - 12:00pm
SW: Extended Batch Sessions and Three-Phase Debugging: Using DMTCP to Enhance the Batch Environment

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Batch environments are notoriously unfriendly because it's not easy to interactively diagnose the health of a job. A job may be terminated without warning when it reaches the end of an allotted runtime slot, or it may terminate even sooner due to an unsuspected bug that occurs only at large scale.
Two strategies are proposed that take advantage of DMTCP for system-level checkpointing. First, we describe how to easily implement extended batch sessions that overcome the typical limitation of 24 hours maximum for a single batch job on large HPC resources. This removes the necessity for the application-specificcheckpointing found in many long-running codes. Second, we describe a three-phase debugging strategy that permits one to interactively debug long-running MPI applications that were developed for non-interactive batch environments.


Wednesday July 20, 2016 11:30am - 12:00pm EDT
Brickell