Troubleshooting
This guide helps you diagnose and resolve common issues with Jobmon workflows.
Quick Diagnostics
When something goes wrong, start here:
Check workflow status in the GUI or CLI
Look at task errors in the Task Details page
Review log files (stdout/stderr paths)
Check the Jobmon server is accessible
Common Errors
DistributorNotAlive Error
Symptom: DistributorNotAlive exception when running a workflow.
Cause: Usually occurs when running from a login node instead of a submit node.
Solution:
# Start an interactive session first
srun --pty bash
# Then run your workflow
python my_workflow.py
NO_DISTRIBUTOR_ID Error
Symptom: TaskInstance shows NO_DISTRIBUTOR_ID status.
Cause: Jobmon couldn’t submit the job to the cluster. Common causes:
Insufficient permissions for the partition/queue
Resource requests exceed queue limits
Invalid project code
Solution:
Check the error details in the GUI (Task Details → TaskInstances → Standard Error)
Verify your queue/partition access
Ensure resource requests are within limits
Check your project code is valid
Connection Refused
Symptom: ConnectionRefusedError or timeout when running workflow.
Cause: Can’t connect to the Jobmon server.
Solution:
Verify network connectivity (VPN if required)
Check server URL in your configuration:
cat ~/.jobmon.yamlTest the server directly:
curl http://your-jobmon-server:5000/health
Workflow Already Exists
Symptom: Error about workflow already existing when trying to run.
Cause: Trying to create a workflow with the same workflow_args as an existing one.
Solution:
To resume the existing workflow:
workflow.run(resume=True)To create a new workflow: Use different
workflow_args
Resource Errors
Out of Memory (OOM)
Symptom: Task fails with RESOURCE_ERROR status, memory-related error in logs.
Solution:
Check actual memory usage in the GUI
Increase memory request:
compute_resources={"memory": "20G", ...}
Enable automatic resource scaling:
task = template.create_task( max_attempts=3, # Will retry with more resources ... )
Timeout / Runtime Exceeded
Symptom: Task killed for exceeding runtime.
Solution:
Increase runtime:
compute_resources={"runtime": "4h", ...}
Use a queue with longer limits
Or set a fallback queue:
task = template.create_task( fallback_queues=["long.q"], ... )
Debugging Workflows
Using the GUI
The Jobmon GUI is the fastest way to investigate issues:
Find your workflow by name or ID
Click to see task breakdown by status
Click a failed task to see: - Error messages - Resource usage - stdout/stderr file paths - Retry history
Using the CLI
# Check workflow status
jobmon workflow_status -w <workflow_id>
# See task details
jobmon workflow_tasks -w <workflow_id> -s FATAL
# Check specific task
jobmon task_status -t <task_id>
# See task dependencies
jobmon task_dependencies -t <task_id>
Reading Log Files
Find log file paths:
jobmon get_filepaths -w <workflow_id>
Or check the Task Details page in the GUI.
Common Patterns
Tasks Stuck in PENDING
Possible causes:
Upstream tasks haven’t completed
Cluster is busy (check queue)
Concurrency limit reached
Check:
jobmon task_dependencies -t <task_id>
Tasks Fail Immediately
Possible causes:
Command not found (check PATH)
Missing dependencies (conda environment)
File not found errors
Debug: Run the command manually to see the actual error.
Workflow Hangs
Possible causes:
Network issues to Jobmon server
All tasks waiting on failed upstream
Workflow timeout reached
Check: Look at the workflow status in the GUI.
Getting More Help
If you’re still stuck:
Check the full error message and stack trace
Search existing issues: https://github.com/ihmeuw-scicomp/jobmon/issues
Ask for help with: - Workflow ID - Error message - What you were trying to do - Relevant code snippets
See Also
Monitoring - Monitoring workflows
CLI Reference - CLI command reference