Troubleshooting

This guide helps you diagnose and resolve common issues with Jobmon workflows.

Quick Diagnostics

When something goes wrong, start here:

Check workflow status in the GUI or CLI
Look at task errors in the Task Details page
Review log files (stdout/stderr paths)
Check the Jobmon server is accessible

Common Errors

DistributorNotAlive Error

Symptom: DistributorNotAlive exception when running a workflow.

Cause: Usually occurs when running from a login node instead of a submit node.

Solution:

# Start an interactive session first
srun --pty bash

# Then run your workflow
python my_workflow.py

NO_DISTRIBUTOR_ID Error

Symptom: TaskInstance shows NO_DISTRIBUTOR_ID status.

Cause: Jobmon couldn’t submit the job to the cluster. Common causes:

Insufficient permissions for the partition/queue
Resource requests exceed queue limits
Invalid project code

Solution:

Check the error details in the GUI (Task Details → TaskInstances → Standard Error)
Verify your queue/partition access
Ensure resource requests are within limits
Check your project code is valid

Connection Refused

Symptom: ConnectionRefusedError or timeout when running workflow.

Cause: Can’t connect to the Jobmon server.

Solution:

Verify network connectivity (VPN if required)
Check server URL in your configuration:
```
cat ~/.jobmon.yaml
```

Test the server directly:

curl http://your-jobmon-server:5000/health

Workflow Already Exists

Symptom: Error about workflow already existing when trying to run.

Cause: Trying to create a workflow with the same workflow_args as an existing one.

Solution:

To resume the existing workflow: workflow.run(resume=True)
To create a new workflow: Use different workflow_args

Resource Errors

Out of Memory (OOM)

Symptom: Task fails with RESOURCE_ERROR status, memory-related error in logs.

Solution:

Check actual memory usage in the GUI

Increase memory request:

compute_resources={"memory": "20G", ...}

Enable automatic resource scaling:

task = template.create_task(
    max_attempts=3,  # Will retry with more resources
    ...
)

Timeout / Runtime Exceeded

Symptom: Task killed for exceeding runtime.

Solution:

Increase runtime:

compute_resources={"runtime": "4h", ...}

Use a queue with longer limits

Or set a fallback queue:

task = template.create_task(
    fallback_queues=["long.q"],
    ...
)

Debugging Workflows

Using the GUI

The Jobmon GUI is the fastest way to investigate issues:

Find your workflow by name or ID
Click to see task breakdown by status
Click a failed task to see: - Error messages - Resource usage - stdout/stderr file paths - Retry history

Using the CLI

# Check workflow status
jobmon workflow_status -w <workflow_id>

# See task details
jobmon workflow_tasks -w <workflow_id> -s FATAL

# Check specific task
jobmon task_status -t <task_id>

# See task dependencies
jobmon task_dependencies -t <task_id>

Reading Log Files

Find log file paths:

jobmon get_filepaths -w <workflow_id>

Or check the Task Details page in the GUI.

Common Patterns

Tasks Stuck in PENDING

Possible causes:

Upstream tasks haven’t completed
Cluster is busy (check queue)
Concurrency limit reached

Check:

jobmon task_dependencies -t <task_id>

Tasks Fail Immediately

Possible causes:

Command not found (check PATH)
Missing dependencies (conda environment)
File not found errors

Debug: Run the command manually to see the actual error.

Workflow Hangs

Possible causes:

Network issues to Jobmon server
All tasks waiting on failed upstream
Workflow timeout reached

Check: Look at the workflow status in the GUI.

Getting More Help

If you’re still stuck:

Check the full error message and stack trace
Search existing issues: https://github.com/ihmeuw-scicomp/jobmon/issues
Ask for help with: - Workflow ID - Error message - What you were trying to do - Relevant code snippets

Troubleshooting

Quick Diagnostics

Common Errors

DistributorNotAlive Error

NO_DISTRIBUTOR_ID Error

Connection Refused

Workflow Already Exists

Resource Errors

Out of Memory (OOM)

Timeout / Runtime Exceeded

Debugging Workflows

Using the GUI

Using the CLI

Reading Log Files

Common Patterns

Tasks Stuck in PENDING

Tasks Fail Immediately

Workflow Hangs

Getting More Help

See Also