Skip to content

H200 Partition

The H200 partition comprises of 8 nodes with 70 H200 GPUs. Access to this partition is limited and is only available by direct request from a faculty PI. To request general access, please fill out this Qualtrics survey.

Note

Please note that the early access partition and account, h200ea will be retired soon.

There are two H200 partitions available on the DCC.

  1. General access pre-empted scavenger-h200
  2. High-priority access h200-hp

Submitting jobs to the scavenger-h200 partition

The general access scavenger-h200 partition has a 2-GPU per user and 24 hour max walltime limit as set by the H200 governance committee. Please note that jobs on this partition may be pre-empted for higher priority jobs and will be cancelled and requeued. Therefore, make sure you check-point your jobs to ensure progress is not lost.

Slurm settings:

#SBATCH -A scavenger-h200
#SBATCH -p scavenger-h200
#SBATCH --gres=gpu:h200:N #N is the number of GPUs

For Open OnDemand sessions, specify the Account as scavenger-h200 and Partition as scavenger-h200.

All H200 jobs will need to specify the “gres” with the “h200” GPU type, e.g.

#SBATCH --gres=gpu:h200:1

This is to accommodate the MIG settings on dcc-h200-gpu-05.

The default walltime on the scavenger-h200 partition is 30 minutes. To change the requested walltime set the following slurm parameter,

#SBATCH --time=10:00:00

in your job submission script. The maximum is 24 hours for this partition.

Submitting jobs to the h200-hp partition

The high-priority, non-preempted h200-hp partition is reserved for select lab groups. The partition can be accessed through two accounts,

  1. <labname>_h200​​ account: unrestricted account with 8 GPU/user concurrent limit with a 7-day max walltime
  2. <labname>_h200_r​ account: restricted account with 2 GPU/user concurrent limit with a 1-day max walltime

Both these accounts are linked to a single QoS, <labname>_h200​ to track billing. Each lab is allocated a weekly quota of GPU-minutes (1 GPU-minute = a job consuming 1 GPU x 1 minute) which will be reset at the start of the week (Monday at 00:00 ET).

You may check your lab's quota with the following command from a DCC login node,

get_h200_usage.sh <labname>_h200

E.g., for a lab group rescomp,

$ get_h200_usage.sh rescomp_h200
QoS                  | Billing Minutes      | Used                 | Remaining           
---------------------+----------------------+----------------------+---------------------
rescomp_h200         | 120                  | 84                   | 36       

Slurm settings:

#SBATCH -A <labname>_h200 #or <labname>_h200_r
#SBATCH -p h200-hp
#SBATCH --gres=gpu:h200:N #N is the number of GPUs

For Open OnDemand sessions, specify the Account as <labname>_h200 or <labname>_h200_r and Partition as h200-hp.

All H200 jobs will need to specify the “gres” with the “h200” GPU type, e.g.

#SBATCH --gres=gpu:h200:1

This is to accommodate the MIG settings on dcc-h200-gpu-05.

The default walltime on the h200-hp partition is 1-hour. To change the requested walltime set the following slurm parameter,

#SBATCH --time=10:00:00

in your job submission script. The maximum is 7 days for this partition.

Measuring GPU Efficiency

As the H200 GPUs are a limited shared resource, we encourage users to be mindful of their GPU usage and to use the resources efficiently. We will be sharing weekly usage reports with the goal of helping users better understand their GPU usage patterns and to support more efficient and effective use of the partition.

Users may also use the slurm-gpu tool developed by Joe Shamblin to measure the GPU efficiency and GPU memory efficiency of their jobs.

  • The GPU efficiency (GPUEff) represents the percentage of time GPU compute resources were actively engaged as reported by nvidia-smi.
  • The GPU memory efficiency (GPUMemEff) represents the percentage of GPU memory (a H200 GPU has a total of 141 GB VRAM) that was actively used during the job.

From a login node, run the following command to check the GPU efficiency of your jobs for a specified time range,

slurm-gpu report -r <partition> -S YYYY-MM-DD -E YYYY-MM-DD -u ${USER}

E.g.,

$ slurm-gpu report -r scavenger-h200 -S 2026-03-10 -E 2026-03-20 -u ${USER}
┌──────┬────────────┬──────────────┬──────────┬─────────┬────────┬────────┬────────┬─────────┬───────────┬────────┬────────────────┐
 User  JobID       State          Elapsed  TimeEff  CPUEff  MemEff  GPUEff  GPUUtil  GPUMemEff  GPUMem  Partition      ╞══════╪════════════╪══════════════╪══════════╪═════════╪════════╪════════╪════════╪═════════╪═══════════╪════════╪════════════════╡
 ukh   44235552_0  COMPLETED     00:10:53    90.7%    2.1%   60.0%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_1  COMPLETED     00:10:53    90.7%    2.1%   60.7%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_2  COMPLETED     00:10:53    90.7%    2.1%   59.8%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_3  COMPLETED     00:10:53    90.7%    2.1%   60.3%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_4  COMPLETED     00:10:53    90.7%    2.1%   60.0%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_5  COMPLETED     00:10:47    89.9%    2.0%   61.2%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_6  COMPLETED     00:10:48    90.0%    3.5%   61.8%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_7  COMPLETED     00:10:42    89.2%    2.0%   64.9%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_8  COMPLETED     00:10:47    89.9%    2.0%   63.2%  100.0%     100%      89.7%  125.6G  scavenger-h200  ukh   44235552_9  COMPLETED     00:10:41    89.0%    1.9%   66.7%  100.0%     100%      89.7%  125.6G  scavenger-h200                    WEIGHTED AVG  01:48:10             2.2%   61.8%  100.0%   100.0%      89.7%     ---                 └──────┴────────────┴──────────────┴──────────┴─────────┴────────┴────────┴────────┴─────────┴───────────┴────────┴────────────────┘

The last row of the output table shows the time-weighted average GPU efficiency and GPU memory efficiency across all your jobs in the specified time range. This quantity is formulated as,

\[ \text{Time-weighted GPU Efficiency} = \frac{\sum_i\left(\text { GPUEff }_i \times \text { time }_i\right)}{\sum_i \text { time }_i} \]

where, \(i\) is the index for your jobs, \(GPUEff_i\) is the GPU efficiency of job \(i\), and \(\text{time}_i\) is the elapsed time of job \(i\). A similar formula applies for the time-weighted GPU memory efficiency.

Fractionalized GPUs

We have enabled MIG on some of the GPUs within the scavenger-h200 partition. Two GPUs on dcc-h200-gpu-05 (0 and 1) have been fractionalized (one into 7 parts and one into 2 parts). Fractional GPUs can be requested with:

--gres=gpu:h200_1g.18gb:1 # Upto 7 MIG GPUs
--gres=gpu:h200_4g.71gb:1 # Upto 1 MIG GPU
--gres=gpu:h200_3g.71gb:1 # Upto 1 MIG GPU