Wednesday, July 20 • 11:30am - 12:00pm
SW: Extended Batch Sessions and Three-Phase Debugging: Using DMTCP to Enhance the Batch Environment

Batch environments are notoriously unfriendly because it's not easy to interactively diagnose the health of a job. A job may be terminated without warning when it reaches the end of an allotted runtime slot, or it may terminate even sooner due to an unsuspected bug that occurs only at large scale.
Two strategies are proposed that take advantage of DMTCP for system-level checkpointing. First, we describe how to easily implement extended batch sessions that overcome the typical limitation of 24 hours maximum for a single batch job on large HPC resources. This removes the necessity for the application-specificcheckpointing found in many long-running codes. Second, we describe a three-phase debugging strategy that permits one to interactively debug long-running MPI applications that were developed for non-interactive batch environments.

