******** Overview ******** What is Jobmon? =============== Jobmon is a Scientific Workflow Management system developed for managing complex computational workflows on distributed computing systems. It provides: - **An easy to use Python API and R API** for defining workflows - **Centralized monitoring** of jobs, including statuses and errors - **A central SQL database** with all information on past, current, and future runs - **Automatic retries** to protect against random cluster failures - **Resource-aware retries** that automatically increase memory or runtime after failures - **Whole-of-workflow resumes** to handle missing data or in-flight code fixes - **Application structure** to organize what would otherwise be a soup of jobs - **Fine-grained job dependencies**, including for jobs within "job arrays" - **An easy-to-use GUI** for monitoring and debugging Key Concepts ============ Before diving in, it helps to understand a few key concepts: Workflow -------- A **Workflow** is a collection of Tasks and their dependencies. Think of it as the complete plan for a computational pipeline. For example, a workflow might process data for multiple locations, aggregate results, and generate reports. Task ---- A **Task** is a single executable command in your workflow. Each task runs independently and can depend on other tasks completing first. TaskTemplate ------------ A **TaskTemplate** is a pattern for creating similar tasks. Instead of defining each task individually, you define a template and then create tasks by filling in the variable parts (like location IDs or dates). Distributor ----------- A **Distributor** is where tasks actually run. Jobmon supports multiple distributors: - **Slurm**: For HPC clusters running Slurm - **Multiprocess**: For running tasks locally using multiple CPU cores - **Sequential**: For running tasks one at a time (useful for debugging) How It Works ============ .. code-block:: text ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Your Python/R │────▶│ Jobmon Server │────▶│ Distributor │ │ Script │ │ (Database) │ │ (Slurm, etc.) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ │ Define workflow │ Track state │ Run tasks │ Add tasks │ Handle retries │ Report status │ Set dependencies │ Store results │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Jobmon GUI │ │ Monitor progress, debug failures │ └─────────────────────────────────────────────────────────────────┘ 1. **Define**: You write a Python or R script that defines your workflow 2. **Submit**: Jobmon validates your workflow and stores it in the database 3. **Execute**: The distributor runs your tasks on the cluster 4. **Monitor**: Track progress via CLI, GUI, or programmatically 5. **Resume**: If something fails, fix it and resume from where you left off When to Use Jobmon ================== Jobmon is ideal when you need to: - Run the same analysis across many parameter combinations (locations, years, etc.) - Manage complex dependencies between computational steps - Automatically handle transient failures (network issues, bad nodes) - Track resource usage and optimize future runs - Resume failed workflows without re-running completed work - Monitor long-running pipelines Jobmon may be overkill if you're just running a single script or a few independent jobs. Next Steps ========== - :doc:`installation` - Get Jobmon installed - :doc:`quickstart` - Create your first workflow - :doc:`/user_guide/core_concepts` - Deep dive into Jobmon concepts