Queues on Discovery Cluster

Queues on Discovery Cluster are based on eight node groups: (Note that nodes in the range from compute-3-*, compute-4-* and higher are owned by faculty and for now are not open for use in scavenging queues. These nodes are in queues for use exclusively for faculty that own them.)

  • nodes10g: These are the nodes that have the 10 Gb/s TCP/IP backplane (no IB) – “compute-0-000 to compute-0-003″ and “compute-0-008 to compute-0-063″. Each node has 128GB of RAM.
  • nodes10g2: These are nodes compute-0-066 to compute-0-095 that have faster CPU’s (Intel Xeon CPU E5-2680 2.8GHz). These nodes have 40 logical cores each, and 64GB of RAM. These nodes also have the 10Gb/s TCP/IP backplane (no IB).
  • nodesib: These are the nodes that have the 10 Gb/s TCP/IP backplane and the FDR 56 Gb/s RDMA backplane - “compute-1-064 to compute-1-127″. Each node has 128GB of RAM.
  • nodes10gint: These are nodes compute-0-000, compute-0-001, compute-0-002 and compute-0-003 that users can use for interactive work. Users can request via LSF interactive nodes here with 1 or more cores up to a maximum of 16 cores on each node.
  • nodesibint: These are nodes compute-1-064, compute-1-065, compute-1-066 and compute-1-067 that users can use for interactive work. Users can request via LSF interactive nodes here with 1 or more cores up to a maximum of 16 cores on each node.
  • nodes10ght: These are nodes compute-0-004, compute-0-005, compute-0-006, compute-0-007 that users can use for jobs using the 10 Gb/s backplane only but gain using Intel Hyper-threading (HT) and Intel Turbo-Boost. These nodes have 32 logical cores as opposed to the 16 logical cores that the other compute nodes have as HT is turned off on these nodes.
  • nodesgpu: These are nodes compute-2-128 to compute-2-159 that have a NVIDIA TESLA K20m GPU each. These nodes have 32 logical cores each, and on each node the GPU has 2495 CUDA computing cores. Each node has 128GB of RAM.
  • nodeslargemem: These are nodes compute-2-000 to compute-2-003 that have large memory – 384GB RAM, a 2TB swap file with 1TB local storage on each node. These nodes have 32 logical cores each, that runs at 2.6GHz. Each large memory node has a two 10Gb/s network drops connected to it and bonded into a single trunk for larger bandwidth.

To check status of nodes use bhosts or bhosts <node_group> as shown below:bhosts_nodegroup

Every queue on Discovery cluster has a Wall Clock limit of 24 hours, except for the “largemem-10g” queue that has no time limit. You will need to partition your long running jobs to smaller ones that run no more than 24 hours at a time. For interactive queues after 24 hours you will be logged out of the interactive node assigned to you and you will have to resubmit a request for an interactive node and login again. Remember to save your work when using interactive queues before 24 hours elapse from every login.

There are eight queues for users currently on Discovery cluster. Six queues are open to all users and two to users approved by the RCC (Research Computing Advisory Committee) and ITS – Research Computing.

Queues open to all users are “interactive-10g”, ht-10g, “ser-par-10g”, “ser-par-10g-2″, “par-gpu” and “largemem-10g”. The first uses node group “nodes10gint”, the second uses node group “nodes10ght” and the third “nodes10g” which are the compute nodes “compute-0-000 to compute-0-063″ that have the 10Gb/s backplane only, the fourth “nodes10g2″ which are the compute nodes “compute-0-066 to compute-0-095″ that also have the 10Gb/s backplane only, and the fifth “nodesgpu” that also has the 10Gb/s backplane only. The last “largemem-10g” queue has each node having a dual bonded 10Gb/s connection for larger bandwidth and throughput.

Queues open to users approved by the RCC are interactive-ib” and “parallel-ib”. The former uses node group “nodesibint” and the latter “nodesib” which are the compute nodes “compute-1-064 to compute-1-127″ that have the 10Gb/s and FDR 56Gb/s IB backplane.

Further details of the eight queues are shown below. Click on the any image below for better resolution. For instructions on using these queues go here. Note that “interactive-10g” and “ser-par-10g” are the default interactive and run queues if the “-q” #BSUB option is not used.

“interactive-10g” queue details (interactive only, no batch jobs): interacitve-10g_queue_details

“interactive-ib” queue details (interactive only, no batch jobs): interacitve-ib_queue_details

“ser-par-10g” queue details (batch only, no interactive jobs):  ser-par-10g_queue_details

“ser-par-10g-2″ queue details (both interactive and batch jobs):ser-par-10g-2-det

“parallel-ib” queue details (batch only, no interactive jobs): parallel-ib_queue_details

ht-10g queue details (both interactive and batch jobs):ht-10g_queue_details

par-gpu queue details (both interactive and batch jobs):par_gpu_queue_details

largemem-10g queue details (both interactive and batch jobs):largemem-10g-queue