Partitions on Discovery Cluster

Partitions on Discovery Cluster for general users are based on 12 node groups: (Note that nodes compute-0-064, compute-0-065 and nodes in the range from compute-3-*, compute-4-* and higher are owned by faculty and for now are not open for use in scavenging partitions. These nodes are in SLURM partitions for use exclusively for faculty that own them.)

  • ser-par-10g: This partition has nodes that have the 10 Gb/s TCP/IP backplane (no IB) – “compute-0-001 to compute-0-003″ and “compute-0-008 to compute-0-063″. Each node has 128GB of RAM.
  • ser-par-10g-2: This partition has nodes compute-0-066 to compute-0-095 that have faster CPU’s (Intel Xeon CPU E5-2680 2.8GHz). These nodes have 40 logical cores each, and 64GB of RAM. These nodes also have the 10Gb/s TCP/IP backplane (no IB).
  • ser-par-10g-3: This partition has nodes compute-0-096 to compute-0-143 that have faster CPU’s (Intel Xeon CPU E5-2680 2.8GHz). These nodes have 40 logical cores each, and 128GB of RAM. These nodes also have the 10Gb/s TCP/IP backplane (no IB).
  • ser-par-10g-4: This partition has nodes compute-0-144 to compute-0-327 that have faster CPU’s (Intel Xeon CPU E5-2690 v3 2.6GHz). These nodes have 48 logical cores each, and 128GB of RAM. These nodes also have the 10Gb/s TCP/IP backplane (no IB).
  • parallel-ib: This partition has nodes that have the 10 Gb/s TCP/IP backplane and the FDR 56 Gb/s RDMA backplane - “compute-1-064 to compute-1-127″. Each node has 128GB of RAM.
  • interactive-10g: This partition has nodes compute-0-000, compute-0-001, compute-0-002 and compute-0-003 that users can use for interactive work. Users can request via SLURM interactive nodes here with 1 or more cores up to a maximum of 16 cores on each node.
  • interactive-ib: This partition has nodes compute-1-064, compute-1-065, compute-1-066 and compute-1-067 that users can use for interactive work. Users can request via SLURM interactive nodes here with 1 or more cores up to a maximum of 16 cores on each node.
  • ht-10g: This partition has nodes compute-0-004, compute-0-005, compute-0-006, compute-0-007 that users can use for jobs using the 10 Gb/s backplane only but gain using Intel Hyper-threading (HT) and Intel Turbo-Boost. These nodes have 32 logical cores as opposed to the 16 logical cores that the other compute nodes have as HT is turned off on these nodes.
  • par-gpu: This partition has nodes compute-2-128 to compute-2-159 that have a NVIDIA TESLA K20m GPU each. These nodes have 32 logical cores each, and on each node the GPU has 2495 CUDA computing cores. Each node has 128GB of RAM.
  • par-gpu-2: This partition has nodes compute-2-160 to compute-2-175 that have a NVIDIA TESLA K40m GPU each. These nodes have 48 logical cores each, and on each node the GPU has 2880 CUDA computing cores. Each node has 128GB of RAM.
  • largemem-10g: This partition has nodes compute-2-000 to compute-2-003 that have large memory – 384GB RAM, a 2TB swap file with 1TB local storage on each node. These nodes have 32 logical cores each, that runs at 2.6GHz. Each large memory node has a two 10Gb/s network drops connected to it and bonded into a single trunk for larger bandwidth.
  • hadoop-10g: This partition has nodes compute-2-004 to compute-2-006 that have 128GB RAM, and 18+TB of disks each. These nodes have 40 logical cores each, that runs at 2.8GHz. Each node has a two 10Gb/s network drops connected to it and bonded into a single trunk for larger bandwidth. The three nodes provide a Hadoop File System (HDFS) of 50TB usable.

To check status of nodes in a partition use sinfo as shown below:slurm_1

Every partition on Discovery cluster has a Wall Clock limit of 24 hours, except for the “largemem-10g” and “hadoop-10g” partitions that have no time limit. You will need to partition your long running jobs to smaller ones that run no more than 24 hours at a time. For interactive queues after 24 hours you will be logged out of the interactive node assigned to you and you will have to resubmit a request for an interactive node and login again. Remember to save your work when using interactive queues before 24 hours elapse from every login.

There are twelve partitions for general users currently on Discovery cluster. Ten partitions are open to all users, and two to users approved by the RCC (Research Computing Advisory Committee) and ITS – Research Computing.

Partitions open to all users are “interactive-10g”, “ht-10g”, “ser-par-10g”, “ser-par-10g-2″, “ser-par-10g-3″, “ser-par-10g-4″, “par-gpu”, “par-gpu-2″, “largemem-10g” and “hadoop-10g”. The last two partitions are “largemem-10g” and “hadoop-10g” queues have each node having a dual bonded 10Gb/s connection for larger bandwidth and throughput.

Partitions open to users approved by the RCC are “interactive-ib” and “parallel-ib”. The former uses node group “ompute-1-[064-067]” and the latter “compute-1-[064-127]” having the 10Gb/s and FDR 56Gb/s IB backplane and RDMA.

Further details of the twelve queues are shown below. (Click on the image for better resolution).
slurm_2
slurm_3

For instructions on using these queues go here. Note that “ser-par-10g” is the default partition if the “-p” option in #SBATCH is not used.