Frequently Asked Question

Introduction to HPC Clusters
Last Updated 5 months ago

Introduction to HPC Clusters:

image


A High-Performance Computing (HPC) cluster consists of a network of interconnected servers designed to conduct complex scientific simulations. When multiple servers are connected through a High-Speed Network (HSN) in specific configurations, they form what is known as a supercomputer, enabling fast communication with minimal delays and high bandwidth.

These servers, referred to as compute nodes, leverage the HSN for efficient scaling while executing demanding scientific applications on advanced CPUs and GPUs. Commonly, HPC workloads are derived from disciplines such as Computational Fluid Dynamics, Earth Sciences, Climate Modelling, Computational Chemistry, Bioinformatics, and, more recently, Artificial Intelligence.

To optimize performance, each compute node operates on a lightweight operating system that runs only the critical processes necessary for its function. This streamlined approach makes them less adaptable compared to the standard CPUs found in personal computers. HPC clusters and supercomputers feature numerous high-performance, often specialized compute units along with substantial high-speed memory, along with dedicated power and cooling systems to support their operations.

Users interact with the system via specialized login nodes, which are designed for tasks like file management, job submissions, and application compilation. Given that these login nodes are shared resources, users should refrain from executing resource-intensive processes on them. System administrators may issue warnings and enforce access restrictions for those who misuse these resources repeatedly.

File input/output (I/O) operations are integral to application performance. In typical personal computers, processes handle I/O sequentially at a single filesystem mount point, requiring other processes to wait for their turn. However, a parallel filesystem distributes large datasets across multiple storage devices, allowing concurrent access and significantly enhancing performance. This type of filesystem is crucial in HPC environments, accessible from both login and compute nodes, and can scale to several petabytes (PB) of total storage. It adheres to the POSIX standard for file I/O, ensuring various applications can read and write data seamlessly. With data spread across multiple disks, many I/O clients can function at comparable speeds. Furthermore, parallel filesystems support parallel I/O standards such as MPI-IO, HDF5, NetCDF, and ADIOS, allowing for improved bandwidth and reduced latency during data transfers.

The HSN connects compute nodes to one another and to the parallel filesystem, enabling efficient data transfer. The manner in which these nodes are interconnected via HSN plays a significant role in determining the scalability of an HPC cluster.

A job scheduler is employed to help users specify the computational resources required for their workloads, including the expected duration and allocation preferences. Users create job scripts, save them in the parallel filesystem, and submit them to the scheduler from the login node. The scheduler manages the execution of jobs based on user requests, and once a job is completed, the results are stored in the designated location within the parallel filesystem for user access.

Collectively, these components collaborate to establish an efficient HPC cluster or supercomputer.


Please Wait!

Please wait... it will take a second!