IHME Clusters

This page documents the compute clusters available at IHME for running Jobmon workflows.

Slurm Cluster

IHME’s primary compute cluster runs Slurm. This is the default cluster for all Jobmon workflows.

Cluster Name

When specifying the cluster in your code:

tool = Tool(name="my_tool")
tool.set_default_cluster_name("slurm")

Available Queues

Queue	Max Runtime	Max Memory	Use Case
all.q	3 days	750GB	General purpose jobs
long.q	16 days	750GB	Long-running jobs
d.q	24 hours	1TB	High-memory jobs

Note

Queue limits may change. Check with Scientific Computing for current values.

Default Resources

If not specified, tasks use these defaults:

Cores: 1
Memory: 1GB
Runtime: 10 minutes
Queue: all.q

Archive Nodes

To access /snfs1 (the J-drive), request an archive node:

task = template.create_task(
    compute_resources={
        "cores": 1,
        "memory": "10G",
        "runtime": "1h",
        "constraints": "archive"
    }
)

Projects

You must specify a project for accounting:

compute_resources={
    "project": "proj_scicomp",
    # ... other resources
}

Contact your team lead for the correct project code.

Other Distributors

For development and testing, you can also use:

Multiprocess Distributor

Runs tasks locally using multiple CPU cores:

tool.set_default_cluster_name("multiprocess")

Sequential Distributor

Runs tasks one at a time (useful for debugging):

tool.set_default_cluster_name("sequential")

Dummy Distributor

Simulates job submission without actually running anything:

tool.set_default_cluster_name("dummy")

Troubleshooting

Job Won’t Submit

Check your project code is valid
Verify you have access to the requested queue
Ensure resource requests are within queue limits

Jobs Pending Too Long

Check cluster utilization with squeue
Consider using a different queue
Reduce resource requests if possible

For additional help, see IHME Support.