Overview

What is Jobmon?

Jobmon is a Scientific Workflow Management system developed for managing complex computational workflows on distributed computing systems. It provides:

  • An easy to use Python API and R API for defining workflows

  • Centralized monitoring of jobs, including statuses and errors

  • A central SQL database with all information on past, current, and future runs

  • Automatic retries to protect against random cluster failures

  • Resource-aware retries that automatically increase memory or runtime after failures

  • Whole-of-workflow resumes to handle missing data or in-flight code fixes

  • Application structure to organize what would otherwise be a soup of jobs

  • Fine-grained job dependencies, including for jobs within “job arrays”

  • An easy-to-use GUI for monitoring and debugging

Key Concepts

Before diving in, it helps to understand a few key concepts:

Workflow

A Workflow is a collection of Tasks and their dependencies. Think of it as the complete plan for a computational pipeline. For example, a workflow might process data for multiple locations, aggregate results, and generate reports.

Task

A Task is a single executable command in your workflow. Each task runs independently and can depend on other tasks completing first.

TaskTemplate

A TaskTemplate is a pattern for creating similar tasks. Instead of defining each task individually, you define a template and then create tasks by filling in the variable parts (like location IDs or dates).

Distributor

A Distributor is where tasks actually run. Jobmon supports multiple distributors:

  • Slurm: For HPC clusters running Slurm

  • Multiprocess: For running tasks locally using multiple CPU cores

  • Sequential: For running tasks one at a time (useful for debugging)

How It Works

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Your Python/R  │────▶│  Jobmon Server  │────▶│   Distributor   │
│     Script      │     │   (Database)    │     │  (Slurm, etc.)  │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │                       │
        │  Define workflow      │  Track state          │  Run tasks
        │  Add tasks            │  Handle retries       │  Report status
        │  Set dependencies     │  Store results        │
        ▼                       ▼                       ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Jobmon GUI                              │
│              Monitor progress, debug failures                   │
└─────────────────────────────────────────────────────────────────┘
  1. Define: You write a Python or R script that defines your workflow

  2. Submit: Jobmon validates your workflow and stores it in the database

  3. Execute: The distributor runs your tasks on the cluster

  4. Monitor: Track progress via CLI, GUI, or programmatically

  5. Resume: If something fails, fix it and resume from where you left off

When to Use Jobmon

Jobmon is ideal when you need to:

  • Run the same analysis across many parameter combinations (locations, years, etc.)

  • Manage complex dependencies between computational steps

  • Automatically handle transient failures (network issues, bad nodes)

  • Track resource usage and optimize future runs

  • Resume failed workflows without re-running completed work

  • Monitor long-running pipelines

Jobmon may be overkill if you’re just running a single script or a few independent jobs.

Next Steps