Overview
The Duke Compute Cluster is a general purpose high performance/high-throughput installation, and it is fitted with software used for a broad array of scientific projects. With a few notable exceptions, applications on the cluster are generally Free and Open Source Software.
Quick facts:
- 1360 nodes which combined have more than 45,000 vCPUs, 980 GPUs and 270TB of RAM
- Interconnects are 10 Gbps or 40 Gbps
- Utilizes a 7 Petabyte Isilon file system
- Runs Alma 9 and SLURM is the job scheduler
The DCC is managed and supported by Research Computing.
The Duke Compute Cluster consists of machines that the University has provided for community use and that researchers have purchased to conduct their research. The equipment is housed in enterprise-grade data centers on Duke’s West Campus. Cluster hardware is heterogenous based on installation date and current standards.
General users have access to common nodes purchased by the University and low-priority access to researcher purchased nodes. Researchers who have provided equipment have high-priority access to their nodes in addition to the general access. Low priority consumption of cycles greatly increases the efficiency of the cluster overall, while also providing all users the benefit of being able to access more than their own nodes’ cycles when they might need it. Low priority jobs on the machines yield to high priority jobs.
Cluster Appropriate Use
Users of the cluster agree to only run jobs that relate to the research mission of Duke University. Use of the cluster for the following activities is prohibited:
- Financial gain
- Commercial or business use
- Unauthorized use or storage of copyright-protected or proprietary resources
- Unauthorized mining of data on or off campus (including many web scraping techniques)
Data Security and Privacy
Users of the cluster are responsible for the data they introduce to the cluster and must follow all applicable Duke (including IRB), school, and departmental policies on data management and data use. Security and compliance provisions on the cluster are sufficient to meet the Duke data classification standard for public or restricted data. Use of sensitive data (e.g. legally protected data such as PHI or FERPA) or data bound by certain restrictions in data use agreements is not allowed. Data that has been appropriately de-identified or obfuscated potentially may be introduced to the cluster without violating data use agreements or government regulations.
As a shared resource, privacy on the cluster is constrained and users of the cluster must conduct themselves in a manner that respects other researchers’ privacy. Cluster support staff have access to all data on the cluster and may inspect elements of the system from time to time. Metadata on the cluster and utilization by group (and sometimes user) will be made available to all cluster users and Duke stakeholders.
Use of Shared Resources
Cluster users are working in a shared environment and must adhere to usage best practices to ensure the performance for all cluster users.
Computational Work (Jobs) on Shared Resources
All computational work should be submitted to the cluster through the job scheduler (SLURM). Running jobs on the login nodes is an abuse of the system Common partition resources should be used judiciously. Groups with sustained needs should purchase nodes for high-priority access. Use of scavenger partitions is encouraged for bursting, large workloads, and other short term needs Use of long running jobs on common and scavenger partitions is discouraged. This is for fairness to other users and because node failures and scheduled maintenance may require interruption of processes. The use of check-pointing is good computing practice for long running jobs on all partitions.
Cluster Shared Storage Resources (/cwork, /work and /scratch)
DCC storage is designed and optimized for very large data under computation not data storage. Labs requiring long term data storage may upgrade their group storage or add additional storage at a cost, see our pricing.
In order to keep processing overhead low and operations fast on shared storage, there are no backups, and no logging of usage actions. Since these areas are susceptible to data loss, users of the cluster should retain a copy of their irreplaceable data at a separate location and they should remove results from shared space frequently.
Capacity is at a premium and users should clean up and remove their own data at the conclusion of their computation. Additionally, to prevent shared volumes from filling up, files older than 75 days on /work and will be purged on the 1 and 15 of every month. Notifications will not be sent. Touching files to expressly avoid the purge process is prohibited. If storage utilization reaches potentially impactful levels to users, the following procedure will be used:
- If utilization exceeds 80%, notice will be sent to top storage users advising that we are approaching capacity, save essential results to lab storage, and delete files that are least impactful to ongoing work
- If utilization exceeds 90%, files from the notified top storage users will be purged until utilization is back at 80%
- If the above efforts do not succeed in reducing utilization, a general purge will be run off cycle with decreasing age of files as needed, notifications will be sent to all /work users
Users who require exceptional use of /work (>20TB for more than 1 week) must notify rescomputing@duke.edu. Purge practices will change over time based on the needs of managing the cluster.
Supporting Services
With great thanks, the DCC team has adopted some tools developed by other institutions.
Globus
Globus is a data management service frequently used in the research community to transfer or share large scale research data. It is a non-profit service run by the University of Chicago that is available to all Duke users under a standard Globus subscription.
Data repositories, such as your laptop, campus compute resources, scientific instruments, or archival storage, are connected to Globus as a node. Users and administrators are then able to configure permissions on the nodes so that transfers and/or sharing can be done by appropriate Globus users.
At Duke, you can use Globus to transfer or share public and restricted research data between approved endpoints. It is a common method used to move large data between:
- Scientific instruments and lab storage
- Lab storage and Duke computational resources like the Duke Compute Cluster or HARDAC
- Laptops and shared computing infrastructure
- Duke storage and collaborators at peer institutions like UNC or external computing labs
Why use Globus?
Globus was specifically designed for transferring and sharing research data and offers several key benefits:
Reliability: Data is transferred directly between nodes and Globus automatically tunes performance, maintains security, validates correctness, and recovers from errors to restart the transfer
Unified Interface: All of your Globus nodes are visible in a single web interface that is easy to use
Remote Initiation: Transfers can be initiated from your laptop between any nodes directly without data transversing your laptop
External Collaboration: Globus is used by many research institutions and allows users to connect with their institutional credentials to transfer data intra and inter institution
Globus may be used with public and restricted data at Duke. It may not be used with sensitive data at this time.
Open OnDemand
The DCC Open OnDemand portal is based on the NSF-funded open-source HPC portal created by the Ohio Supercomputer Center. The goal of Open OnDemand is to provide web access to HPC resources. To learn more about the project, visit openondemand.org. OSC has also created extensive user documentation, including videos, if you need more help than is available here.
The DCC implementation of OnDemand is focused on providing web based interactive sessions on the DCC for applications that produce visualizations of data analysis completed on the DCC. OnDemand may also be used RStudio and Jupyter, though limitations exist for users to install their own packages, so these services should be used cautiously.
The DCC OnDemand service is using SLURM to schedule and run interactive sessions on the DCC and all OnDemand users should have basic familiarity with using the DCC in order to use OnDemand.