Architecture and Design
This section documents Jobmon’s system architecture, design decisions, and technical implementation details.
Overview
Jobmon is a distributed workflow management system with several key components:
Client Libraries (Python, R) - Define and submit workflows
REST API Server - Manages state and coordinates execution
Database - Persists workflow state and history
Distributors - Execute tasks on various backends (Slurm, local, etc.)
GUI - Web interface for monitoring and debugging
Architecture Diagram
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Python/R │────▶│ Jobmon Server │────▶│ Distributor │
│ Client │◀────│ (REST API) │◀────│ (Slurm) │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
│
┌────────▼────────┐
│ Database │
│ (MySQL/etc) │
└────────┬────────┘
│
┌────────▼────────┐
│ Jobmon GUI │
│ (React) │
└─────────────────┘
Key Design Principles
- Fault Tolerance
Workflows survive client disconnections, server restarts, and cluster failures.
- Resumability
Any workflow can be resumed from its last known state.
- Scalability
Designed to handle tens of thousands of concurrent tasks.
- Observability
Comprehensive logging, telemetry, and status tracking.
- Cluster Agnostic
Pluggable distributor architecture supports multiple backends.
Documentation Sections
Design Documents
For detailed design decisions and implementation notes, see the design/
directory in the repository:
WORKFLOW_RUN_REFACTORING.md- Workflow execution architectureSTRUCTURED_LOGGING_IMPLEMENTATION_V2.md- Logging system designtask_status_audit_design.md- Task status management
See Also
Developer Guide - Contributing to Jobmon
Logical View (aka software layers, Component View) - Component architecture details
Jobmon’s Finite State Machines - State machine documentation