Jobmon

Jobmon is a Scientific Workflow Management system developed at IHME specifically for the institute’s needs. It is developed and maintained by IHME’s Scientific Computing team. Jobmon aims to reduce human pain by providing:

  • An easy to use Python API and R API.

  • Centralized monitoring of jobs, including the jobs’ statuses and errors.

  • A central SQL database with all information on past, current, and future runs.

  • Automatic retries to protect against random cluster failures.

  • Automatic retries following a resource failure, e.g. re-running a job with increased memory.

  • Whole-of-workflow resumes to handle missing data or in-flight code fixes.

  • Adds Application structure to what otherwise would be a soup of jobs.

  • Fine-grained job dependencies, including for jobs within “job arrays.”

  • An easy-to-use GUI.

Jobmon was originally developed to augment the Univa Grid Engine (UGE) and subsequently ported to Slurm when IHME switch from UGE to Slurm.

Table of Contents

Indices and tables