Overview
The Duke Compute Cluster is a general purpose high performance/high-throughput installation, and it is fitted with software used for a broad array of scientific projects. With a few notable exceptions, applications on the cluster are generally Free and Open Source Software.
Quick facts:
- 1360 nodes which combined have more than 45,000 vCPUs, 980 GPUs and 270TB of RAM
- Interconnects are 10 Gbps or 40 Gbps
- Utilizes a 7 Petabyte Isilon file system
- Runs Alma 9 and SLURM is the job scheduler
The DCC is managed and supported by Research Computing.
The Duke Compute Cluster consists of machines that the University has provided for community use and that researchers have purchased to conduct their research. The equipment is housed in enterprise-grade data centers on Duke’s West Campus. Cluster hardware is heterogenous based on installation date and current standards.
General users have access to common nodes purchased by the University and low-priority access to researcher purchased nodes. Researchers who have provided equipment have high-priority access to their nodes in addition to the general access. Low priority consumption of cycles greatly increases the efficiency of the cluster overall, while also providing all users the benefit of being able to access more than their own nodes’ cycles when they might need it. Low priority jobs on the machines yield to high priority jobs.
Cluster Appropriate Use
Users of the cluster agree to only run jobs that relate to the research mission of Duke University. Use of the cluster for the following activities is prohibited:
- Financial gain
- Commercial or business use
- Unauthorized use or storage of copyright-protected or proprietary resources
- Unauthorized mining of data on or off campus (including many web scraping techniques)
Data Security and Privacy
Users of the cluster are responsible for the data they introduce to the cluster and must follow all applicable Duke (including IRB), school, and departmental policies on data management and data use. Security and compliance provisions on the cluster are sufficient to meet the Duke data classification standard for public or restricted data. Use of sensitive data (e.g. legally protected data such as PHI or FERPA) or data bound by certain restrictions in data use agreements is not allowed. Data that has been appropriately de-identified or obfuscated potentially may be introduced to the cluster without violating data use agreements or government regulations.
As a shared resource, privacy on the cluster is constrained and users of the cluster must conduct themselves in a manner that respects other researchers’ privacy. Cluster support staff have access to all data on the cluster and may inspect elements of the system from time to time. Metadata on the cluster and utilization by group (and sometimes user) will be made available to all cluster users and Duke stakeholders.
Use of Shared Resources
Cluster users are working in a shared environment and must adhere to usage best practices to ensure the performance for all cluster users.
Computational Work (Jobs) on Shared Resources
All computational work should be submitted to the cluster through the job scheduler (SLURM). Running jobs on the login nodes is an abuse of the system Common partition resources should be used judiciously. Groups with sustained needs should purchase nodes for high-priority access. Use of scavenger partitions is encouraged for bursting, large workloads, and other short term needs Use of long running jobs on common and scavenger partitions is discouraged. This is for fairness to other users and because node failures and scheduled maintenance may require interruption of processes. The use of check-pointing is good computing practice for long running jobs on all partitions.
Cluster Shared Storage Resources (/cwork, /work and /scratch)
DCC storage is designed and optimized for very large data under computation not data storage. Labs requiring long term data storage may upgrade their group storage or add additional storage at a cost, see our pricing.
In order to keep processing overhead low and operations fast on shared storage, there are no backups, and no logging of usage actions. Since these areas are susceptible to data loss, users of the cluster should retain a copy of their irreplaceable data at a separate location and they should remove results from shared space frequently.
Capacity is at a premium and users should clean up and remove their own data at the conclusion of their computation. Additionally, to prevent shared volumes from filling up, files older than 75 days on /work and will be purged on the 1 and 15 of every month. Notifications will not be sent. Touching files to expressly avoid the purge process is prohibited. If storage utilization reaches potentially impactful levels to users, the following procedure will be used:
- If utilization exceeds 80%, notice will be sent to top storage users advising that we are approaching capacity, save essential results to lab storage, and delete files that are least impactful to ongoing work
- If utilization exceeds 90%, files from the notified top storage users will be purged until utilization is back at 80%
- If the above efforts do not succeed in reducing utilization, a general purge will be run off cycle with decreasing age of files as needed, notifications will be sent to all /work users
Users who require exceptional use of /work (>20TB for more than 1 week) must notify rescomputing@duke.edu. Purge practices will change over time based on the needs of managing the cluster.
AI Use on the Duke Compute Cluster
The Duke Compute Cluster is a shared research computing environment. Use of AI tools on the DCC must be consistent with Duke’s research, instructional, and administrative purposes; must respect data classification, privacy, and access requirements; and must not interfere with the availability, integrity, or security of DCC systems or other users’ work. Duke computing resources are provided for authorized users and approved purposes, and use of those resources remains subject to Duke’s computing and acceptable-use policies, including Duke’s Use of Computing & Electronic Resources Policy.
AI-assisted development and research workflows may be appropriate on the DCC when they are run transparently, within assigned resource allocations, and with normal user oversight. Users should avoid running autonomous or agentic AI tools in ways that grant broad, unattended control over files, shell commands, network access, credentials, job submission, or other DCC resources.
OpenClaw is not permitted on the Duke Compute Cluster. Users must not install, run, expose, proxy, or connect OpenClaw services from DCC login nodes, compute nodes, interactive sessions, OnDemand environments, or related DCC storage locations.
Users are also strongly discouraged from using AI tools with unsafe permission-bypass modes, including flags or settings such as --dangerously-skip-permissions, bypassPermissions, YOLO mode, blanket tool approval, unrestricted shell execution, disabled approval prompts, or similar configurations. Such modes are especially risky in a shared HPC environment because an AI agent may act with the same access as the user account, including access to files, credentials, job queues, network resources, and shared research data. Anthropic’s Claude Code permission modes documentation describes bypassPermissions as disabling permission prompts and safety checks so tool calls execute immediately, and states that --dangerously-skip-permissions is equivalent.
Labs should be aware that, with the increasing use of AI coding agents, dangerous or poorly supervised AI use can create serious operational and data risks. If lab members use AI tools in unsafe modes, with overly broad permissions, or in ways that allow an agent to make destructive or large-scale changes, DCC or IT staff may not be able to help the lab fully recover from resulting damage. This may include loss or corruption of files, deletion or modification of code or research data, unintended job submissions, credential exposure, or changes to software environments. Labs remain responsible for maintaining appropriate backups, supervising AI-assisted workflows, and ensuring that lab members understand the risks before using AI coding agents on shared computing resources.
DCC will not allow the use of AI tools to negatively impact the performance, stability, availability, or security of the cluster for other users. Jobs, processes, agents, services, or accounts that degrade shared resources, overload systems or networks, consume excessive resources, create security concerns, or otherwise disrupt other users’ work may be limited, suspended, terminated, or referred for further review as appropriate. If DCC staff warn a user or lab about unsafe, disruptive, or noncompliant AI use and the behavior continues, the user’s DCC access may be suspended or revoked, and the user may be required to meet with DCC staff before access is restored.
Users remain responsible for activity initiated from or attributable to their Duke accounts. If AI use causes security concerns, policy violations, data exposure, excessive resource consumption, service disruption, unauthorized access, or other operational issues, the user of the account may be held accountable under applicable Duke policies and processes. This includes Duke’s Acceptable Use Policy.
The Duke Community Standard’s Computing and Electronic Communication policy states that users are responsible for activities on their user ID or originating from their system; must refrain from monopolizing systems, overloading networks, degrading services, or wasting computer time and other resources; and must not attempt to circumvent or subvert security measures.
For an example of an AI coding-agent workflow that uses appropriate guardrails on the DCC, users may refer to Research Computing’s guide for using Codex CLI or the VS Code Codex extension through Duke OnDemand Code Server.
The guide demonstrates the expected pattern for running AI coding tools in a controlled way: use an interactive OnDemand Code Server session, keep user approval in the loop, configure Codex with a read-only sandbox, require on-request approval, disable sandbox network access, and avoid writable roots by default. This example does not authorize permission-bypass modes, unrestricted agent behavior, or any use that disrupts the DCC or other users. Users and labs remain responsible for supervising AI-assisted workflows and for activity performed by AI tools under their accounts.
The practical expectation is simple: use AI tools on the DCC only in ways you can supervise, explain, and defend. Do not delegate unrestricted control of Duke computing resources to an AI agent. Use restrictive permission modes, keep human approval in the loop, run heavy work through interactive sessions or SLURM as appropriate, protect credentials and sensitive data, maintain backups of important work, and contact Research Computing or IT security staff before attempting higher-risk AI workflows.
Supporting Services
With great thanks, the DCC team has adopted some tools developed by other institutions.
Globus
Globus is a data management service frequently used in the research community to transfer or share large scale research data. It is a non-profit service run by the University of Chicago that is available to all Duke users under a standard Globus subscription.
Data repositories, such as your laptop, campus compute resources, scientific instruments, or archival storage, are connected to Globus as a node. Users and administrators are then able to configure permissions on the nodes so that transfers and/or sharing can be done by appropriate Globus users.
At Duke, you can use Globus to transfer or share public and restricted research data between approved endpoints. It is a common method used to move large data between:
- Scientific instruments and lab storage
- Lab storage and Duke computational resources like the Duke Compute Cluster or HARDAC
- Laptops and shared computing infrastructure
- Duke storage and collaborators at peer institutions like UNC or external computing labs
Why use Globus?
Globus was specifically designed for transferring and sharing research data and offers several key benefits:
Reliability: Data is transferred directly between nodes and Globus automatically tunes performance, maintains security, validates correctness, and recovers from errors to restart the transfer
Unified Interface: All of your Globus nodes are visible in a single web interface that is easy to use
Remote Initiation: Transfers can be initiated from your laptop between any nodes directly without data transversing your laptop
External Collaboration: Globus is used by many research institutions and allows users to connect with their institutional credentials to transfer data intra and inter institution
Globus may be used with public and restricted data at Duke. It may not be used with sensitive data at this time.
Open OnDemand
The DCC Open OnDemand portal is based on the NSF-funded open-source HPC portal created by the Ohio Supercomputer Center. The goal of Open OnDemand is to provide web access to HPC resources. To learn more about the project, visit openondemand.org. OSC has also created extensive user documentation, including videos, if you need more help than is available here.
The DCC implementation of OnDemand is focused on providing web based interactive sessions on the DCC for applications that produce visualizations of data analysis completed on the DCC. OnDemand may also be used RStudio and Jupyter, though limitations exist for users to install their own packages, so these services should be used cautiously.
The DCC OnDemand service is using SLURM to schedule and run interactive sessions on the DCC and all OnDemand users should have basic familiarity with using the DCC in order to use OnDemand.
Advanced Duke GPU Computing Resources
Duke Research Computing maintains state-of-the-art GPU capabilities (also as a condo model) within the Duke Compute Cluster (DCC), including 64 of the latest NVIDIA H200 Tensor Core GPUs configured to support AI research and large language model workloads. These H200 GPUs offer:
- 141 GB of HBM3e memory per GPU for handling the largest deep learning and HPC workloads.
- Up to 4.8 TB/s of memory bandwidth, enabling extremely fast data movement for AI model training and large-scale simulation.
- NVLink high-speed GPU interconnect supporting 900 GB/s bidirectional bandwidth per GPU, allowing multiple H200s in the same node to communicate as a single large memory space with near-zero CPU intervention.
- PCIe Gen5 connectivity for rapid communication with CPUs and storage subsystems.
- Full support for NVIDIA’s CUDA, cuDNN, and AI/ML acceleration libraries, enabling seamless scaling from single-GPU to multi-GPU workloads.
These GPUs are ideal for training foundation models, running large-scale graph analytics, performing genomics pipelines, and conducting physics-based simulations requiring high memory and bandwidth.