************** IHME Clusters ************** This page documents the compute clusters available at IHME for running Jobmon workflows. Slurm Cluster ============= IHME's primary compute cluster runs Slurm. This is the default cluster for all Jobmon workflows. Cluster Name ------------ When specifying the cluster in your code: .. code-block:: python tool = Tool(name="my_tool") tool.set_default_cluster_name("slurm") Available Queues ---------------- .. list-table:: :header-rows: 1 :widths: 20 20 20 40 * - Queue - Max Runtime - Max Memory - Use Case * - all.q - 3 days - 750GB - General purpose jobs * - long.q - 16 days - 750GB - Long-running jobs * - d.q - 24 hours - 1TB - High-memory jobs .. note:: Queue limits may change. Check with Scientific Computing for current values. Default Resources ----------------- If not specified, tasks use these defaults: - **Cores**: 1 - **Memory**: 1GB - **Runtime**: 10 minutes - **Queue**: all.q Archive Nodes ------------- To access ``/snfs1`` (the J-drive), request an archive node: .. code-block:: python task = template.create_task( compute_resources={ "cores": 1, "memory": "10G", "runtime": "1h", "constraints": "archive" } ) Projects -------- You must specify a project for accounting: .. code-block:: python compute_resources={ "project": "proj_scicomp", # ... other resources } Contact your team lead for the correct project code. Other Distributors ================== For development and testing, you can also use: Multiprocess Distributor ------------------------ Runs tasks locally using multiple CPU cores: .. code-block:: python tool.set_default_cluster_name("multiprocess") Sequential Distributor ---------------------- Runs tasks one at a time (useful for debugging): .. code-block:: python tool.set_default_cluster_name("sequential") Dummy Distributor ----------------- Simulates job submission without actually running anything: .. code-block:: python tool.set_default_cluster_name("dummy") Troubleshooting =============== Job Won't Submit ---------------- 1. Check your project code is valid 2. Verify you have access to the requested queue 3. Ensure resource requests are within queue limits Jobs Pending Too Long --------------------- 1. Check cluster utilization with ``squeue`` 2. Consider using a different queue 3. Reduce resource requests if possible For additional help, see :doc:`support`.