IHME Clusters
This page documents the compute clusters available at IHME for running Jobmon workflows.
Slurm Cluster
IHME’s primary compute cluster runs Slurm. This is the default cluster for all Jobmon workflows.
Cluster Name
When specifying the cluster in your code:
tool = Tool(name="my_tool")
tool.set_default_cluster_name("slurm")
Available Queues
Queue |
Max Runtime |
Max Memory |
Use Case |
|---|---|---|---|
all.q |
3 days |
750GB |
General purpose jobs |
long.q |
16 days |
750GB |
Long-running jobs |
d.q |
24 hours |
1TB |
High-memory jobs |
Note
Queue limits may change. Check with Scientific Computing for current values.
Default Resources
If not specified, tasks use these defaults:
Cores: 1
Memory: 1GB
Runtime: 10 minutes
Queue: all.q
Archive Nodes
To access /snfs1 (the J-drive), request an archive node:
task = template.create_task(
compute_resources={
"cores": 1,
"memory": "10G",
"runtime": "1h",
"constraints": "archive"
}
)
Projects
You must specify a project for accounting:
compute_resources={
"project": "proj_scicomp",
# ... other resources
}
Contact your team lead for the correct project code.
Other Distributors
For development and testing, you can also use:
Multiprocess Distributor
Runs tasks locally using multiple CPU cores:
tool.set_default_cluster_name("multiprocess")
Sequential Distributor
Runs tasks one at a time (useful for debugging):
tool.set_default_cluster_name("sequential")
Dummy Distributor
Simulates job submission without actually running anything:
tool.set_default_cluster_name("dummy")
Troubleshooting
Job Won’t Submit
Check your project code is valid
Verify you have access to the requested queue
Ensure resource requests are within queue limits
Jobs Pending Too Long
Check cluster utilization with
squeueConsider using a different queue
Reduce resource requests if possible
For additional help, see IHME Support.