Overview of Discovery Cluster

Discovery_Schematic_Overview

Each of the administrative and login nodes have dual Intel E5 2670 CPU’s @ 2.60 GHz and 256 GByte of RAM. Each of the compute nodes have dual Intel E5 2650 CPU’s @ 2.00 GHz and 128 GByte of RAM. These nodes have 16 physical and 32 logical compute cores. Additional faster compute nodes (not shown in image above) with dual Intel E5 2680 v2 CPU’s @ 2.80 GHz and 64 or 128 GByte of RAM are available with the 10Gb/s backplane. These have larger L1, L2 and L3 caches as compared to the E5 2670 CPU’s for better performance, 20 physical and 40 logical cores per node. There is also a 50TB Hadoop cluster available as part of the Discovery Cluster and several large memory nodes for very large memory simulations on the 10Gb/s backplane.

In addition compute nodes “compute-1-064 to compute-1-127″ have a FDR 56 Gb/s Infiniband (IB) backplane with Remote Direct Memory Access (RDMA) enabled for RDMA enabled applications using IB-VERBS. The IB network controller is Mellanox Technologies MT27500 Family [ConnectX-3]. The IB-VERBS API particular implementation for Mellanox is available here. In general to use the RDMA enabled backplane you must have a RDMA enabled application like NAMD or Gromacs or modify an existing application using the IB-VERBS API. One can also use TCP/IP on the faster 56Gb/s IB backplane as the Infiniband fabric is also configured for IPOIB mode. Thus “non-RDMA enabled” regular TCP/IP based MPI applications that are I/O intensive may benefit from the increased 56Gb/s speed and the relatively lower latencies in the FDR IPOIB backplane (when compared with the 10Gb/s speed and latencies on the 10G Ethernet backplane).

The standard backplane on all the compute nodes including the FDR IB nodes (compute-1-064 to compute-1-127) is 10 Gb/s TCP/IP.

The login and compute nodes mount /home (1 TB), /scratch (50TB) and /share (0.5 TB) using NFS 3 exports from a Isilon NAS storage array using the 10Gb/s TCP/IP backplane. /home is where user directories lie and each user has a 30 GByte quota (hard limit) and 20 GByte soft limit. /share is for cluster software provided through modules. /scratch is for user data during computational runs after which the data must be removed, and is only for temporary use by users. Every user should use no more than 100-300GB of /scratch and users should “rsync” or “sftp” their data out when done. /scratch is not for long term storage so extended use of it over three weeks is against usage policy.

Each compute node has ~700 GBytes local disk space that can also be used temporarily for user data during computational runs, but after a run completes the data must be removed.

GPU Nodes: Discovery Cluster also has 32 GPU nodes (not shown in image above). In each of the 32 nodes (compute-2-128 to compute-2-159) is a NVIDIA Tesla K20m GPU with 2496 computing cuda cores. Details about this GPU are here. The 32 GPU servers have their own queue, with similar processor and memory configuration as the other regular compute nodes. These GPU servers are on the 10Gb/s TCP/IP backplane.

Full details of all the compute nodes and the additional (not shown in image above) newer faster compute nodes using the standard 10Gb/s TCP/IP network backplane, large memory compute nodes, and a 50TB Hadoop cluster configuration is available here.

For a summary of the Discovery Cluster configuration go here.

The Discovery Cluster is located in Massachusetts Green High Performance Computing Center (MGHPCC) in Holyoke, MA. Northeastern University has joined with four other leading institutions – MIT, the University of Massachusetts, Boston University, and Harvard University – as well as the Commonwealth of Massachusetts, the City of Holyoke, EMC and Cisco Systems in a historic project towards the development of a world-class, high-performance academic research computing facility in Holyoke MA, and a statewide collaborative computational research, education and outreach program. This center – the Massachusetts Green High Performance Computing Center (MGHPCC) – is powered by a combination of green and cost-competitive energy, making it a cost-effective and environmentally sound facility.

The MGHPCC serves not only as a collocation facility for the hardware supporting the computational research needs of the individual academic institutions, but also represent a unique opportunity to serve as:

  1. A collaborative facility for strengthening the state’s leadership in the development and application of high performance computing (HPC) in addressing major challenges facing society.
  2. A facility for advancing and showcasing both the research and practice of green computing and smart grids.
  3. A catalyst for the development of the IT industry throughout Massachusetts with economic, educational and workforce development benefits to the city of Holyoke, western Massachusetts, and beyond.

The 90,300 sq.ft. data center facility is located on an 8.6 acre plot in the downtown Holyoke canal district. The flexible design of the facility supports up to 10 MW of HPC equipment through two phases in the first 10 years, with space on-site for further expansion. The partner institutions will benefit from the use of inexpensive hydro-electric power, a low PUE green facility design, a modern, controlled facility with high-speed connectivity, and opportunities for shared services and collaboration with other institutions. Other green features of the facility include brownfield reuse, better cooling due to higher temperature operation, higher efficiency from 400V power use, and options to use canal water for cooling. The facility also has flexible meeting and classroom areas and serves as a hub for education and outreach activities for the partner universities, local colleges and the Holyoke community. The facility connects back to the Northeastern main campus via multiple dedicated 10Gbps optical fibre connections.

Further details are on the MGHPCC web site here.

Northeastern researchers will be able to make use of the facility for their research computing needs, with various support options. The University maintains the shared HPC infrastructure (Discovery cluster and attached storage) in the facility for use by all researchers through fair-share job scheduling. Researchers can also leverage this shared system in grant proposals. In addition, researchers can “buy in” to the shared system and add resources for a free. In return for their incremental investment, they will receive commensurate compute cycles through a preferential queue. Where required, researchers will have access to a dedicated service option, where equipment purchased by the researcher will be hosted and maintained centrally by the University, for use only by the researcher and designees.

Alternatively, researchers may purchase and maintain their HPC equipment themselves, though it is hosted at the MGHPCC facility. The support models associated with these service options for use at the MGHPCC facility, along with the funding and chargeback options can be obtained by contacting NU Research Computing here. The MGHPCC collaboration also provides the partner institutions and their researchers with new means to work together on the complex and pressing research problems of the day, including major cross-collaborative initiatives by the consortium in cyber-security and big data. Collaborative research initiatives in the past few months include a $2.3 million NSF Major Research Instrumentation grant, a $52 million NSF Track 2 proposal, $4.46 million for big-data research from the Massachusetts Life Sciences Center and a Massachusetts Open Cloud initiative under investigation. $1.2 million in seed funds have been allocated for collaborative research by consortium researchers. Northeastern University is also involved, along with its partners, in education and outreach efforts with community colleges, non-profits and entrepreneurs in the Holyoke Valley area. The collaborations are leading to joint seminars, courses and proposals in the educational space, including an electronic textbook grant from NSF, as well as several educational proposals that are currently under review by NSF.

A virtual tour of MGHPCC is below: