Overview
What is Jobmon?
Jobmon is a Scientific Workflow Management system developed for managing complex computational workflows on distributed computing systems. It provides:
An easy to use Python API and R API for defining workflows
Centralized monitoring of jobs, including statuses and errors
A central SQL database with all information on past, current, and future runs
Automatic retries to protect against random cluster failures
Resource-aware retries that automatically increase memory or runtime after failures
Whole-of-workflow resumes to handle missing data or in-flight code fixes
Application structure to organize what would otherwise be a soup of jobs
Fine-grained job dependencies, including for jobs within “job arrays”
An easy-to-use GUI for monitoring and debugging
Key Concepts
Before diving in, it helps to understand a few key concepts:
Workflow
A Workflow is a collection of Tasks and their dependencies. Think of it as the complete plan for a computational pipeline. For example, a workflow might process data for multiple locations, aggregate results, and generate reports.
Task
A Task is a single executable command in your workflow. Each task runs independently and can depend on other tasks completing first.
TaskTemplate
A TaskTemplate is a pattern for creating similar tasks. Instead of defining each task individually, you define a template and then create tasks by filling in the variable parts (like location IDs or dates).
Distributor
A Distributor is where tasks actually run. Jobmon supports multiple distributors:
Slurm: For HPC clusters running Slurm
Multiprocess: For running tasks locally using multiple CPU cores
Sequential: For running tasks one at a time (useful for debugging)
How It Works
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Your Python/R │────▶│ Jobmon Server │────▶│ Distributor │
│ Script │ │ (Database) │ │ (Slurm, etc.) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ Define workflow │ Track state │ Run tasks
│ Add tasks │ Handle retries │ Report status
│ Set dependencies │ Store results │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Jobmon GUI │
│ Monitor progress, debug failures │
└─────────────────────────────────────────────────────────────────┘
Define: You write a Python or R script that defines your workflow
Submit: Jobmon validates your workflow and stores it in the database
Execute: The distributor runs your tasks on the cluster
Monitor: Track progress via CLI, GUI, or programmatically
Resume: If something fails, fix it and resume from where you left off
When to Use Jobmon
Jobmon is ideal when you need to:
Run the same analysis across many parameter combinations (locations, years, etc.)
Manage complex dependencies between computational steps
Automatically handle transient failures (network issues, bad nodes)
Track resource usage and optimize future runs
Resume failed workflows without re-running completed work
Monitor long-running pipelines
Jobmon may be overkill if you’re just running a single script or a few independent jobs.
Next Steps
Installation - Get Jobmon installed
Quickstart - Create your first workflow
Core Concepts - Deep dive into Jobmon concepts