What is High Performance Computing?
High Performance Computing, also known as “HPC”, is the process of harnessing the power of a supercomputer to achieve a much higher level of performance than from a single computer.
HPC is used by organizations all over the world who require exceedingly high-speed computations for scientific and academic research, engineering, manufacturing, and business.
Read more on InsideHPC: What is HPC?
According to ETP4HPC, High Performance Computing (HPC) plays a pivotal role in stimulating economic growth, explaining that it is a pervasive tool allowing industry and academia to develop world-class products, services, and inventions in order to maintain and reinforce Europe’s position on the competitive worldwide arena. HPC is also recognised as crucial in addressing grand societal challenges.
Previously, HPC was characterized by a narrow set of applications used by just a few industries, such as scientific research and academia. These organizations ran predominantly Intel-based on-premise systems. Recently, new technologies such as cloud, edge, data analytics and machine learning have brought huge change to HPC.
First let's look at the AI market. HPC is teaching so much to the relatively nascent machine learning and deep learning industries. As organizations grapple with the enormous computations that fuel AI projects, they can replicate HPC's finely honed methodologies to achieve results quickly and cost-effectively.
With the advent of cloud, HPC no longer has to be on-premises. Increasingly, organizations are turning to hybrid cloud HPC solutions, which allows you to burst into the cloud for additional compute power when your HPC resources reach capacity on premises. This enables you to benefit from the best of both worlds.
And finally, modern day HPC workloads can also take place at the edge, on distributed servers, so organizations can truly manage clusters from the edge to the core to the cloud.
A supercomputer is much more powerful than a standard computer; essentially, it comprises clusters of tightly connected computers that work together. Supercomputers are measured in floating-point operations per second (FLOPS). Referred to as “Parallel Processing”, a supercomputer contains thousands of compute nodes that complete multiple tasks at the same time.
The term “supercomputer” is applied to the world’s fastest high performance systems. The world's fastest 500 supercomputers run Linux-based operating systems.
Technology develops at such a pace that vendors are in a constant battle to build the next most powerful supercomputer. According to Wikipedia, since June 2020, the Japanese Fugaku is the world's most powerful supercomputer, reaching 415.53 petaFLOPS on the LINPACK benchmarks.
HPC applies a supercomputer to a computational problem that is too large or complex for a standard computer. The HPC system is able to cope with enormous compute packages because it runs a network of nodes, each of which contains one or more processing chips and its own memory.
HPC empowers organizations to quickly and efficiently carry out high-scale computational projects, enabling scientists, developers, and researchers rapid access to their results.
According to Network World, “as costs drop and use cases multiply, HPC is attracting new adopters of all types and sizes. Expanding options include supercomputer-based HPC systems, cluster-based HPC, and cloud HPC services. In today's data-driven world, HPC is emerging as the go-to platform for enterprises looking to gain deep insights into areas as diverse as genomics, computational chemistry, financial risk modeling and seismic imaging.”
Initially embraced by research scientists who needed to perform complex mathematical calculations, HPC is now gaining the attention of a more comprehensive number of enterprises spanning an array of fields. Today, all industry verticals have a need for HPC. From the early adopters in the research and scientific community who continue to push the boundaries of HPC, they are joined by higher education establishments carrying out research analysis, healthcare organizations, pharmaceuticals in their quest for new medicines and vaccines, government agencies, as well as the broader commercial market such as automotive, manufacturing, high tech, retail, and more.
HPC is quickly becoming a strategic necessity for all industries that want to gain a competitive advantage in their markets, or at least keep pace with their industry peers in order to survive.
Running a supercomputer is a complicated process. A supercomputer comprises clusters of tightly connected computers that work together. These tightly connected computers contain hard drives, memory, thousands of processors, hundreds of gigabytes of RAM, and much more.
A cluster manager - also known as clustered infrastructure management - automates the process of configuring and administering each cluster. In this way, a cluster manager will radically simplify the process of setting up and deploying a supercomputer, and then add additional value when it comes time to health-check, monitor, or upgrade each area of the cluster.
An HPC cluster won’t work without software. The two most popular choices of HPC software are Linux and Windows. Linux is a family of open-source Unix-like operating systems based on the Linux kernel. Historically, Linux has dominated the HPC industry’s installations, largely due to HPC’s legacy in supercomputing, large scale machines, and Unix.
When it comes time for you to choose an operating system, make sure that you consider the applications that you plan to run on your HPC computer.
HPC is a competitive market. Major vendors such as Dell, HPE Cray, and Lenovo are competing for the same business, so strive to offer evermore powerful technology. As a result, heterogeneity in HPC has never been greater, and, by fault, HPC environments are often a hybrid mix of hardware and software. The hybrid nature of these platforms offers a new level of complexity. For sysadmins, grappling to monitor, healthcheck, manage and upgrade your supercomputer is one thing; doing it with multiple hardware and software vendors in play, and at this scale, is incredibly daunting.
While an HPC computer is more complex than a simple desktop computer, under the hood, you’ve got all the same essential elements. The complexity comes with the scale and size of the HPC environment. Managing such an enormous amount of compute power and the associated elements is not trivial, which is where clustered infrastructure management comes into play.
Clustered infrastructure management software automates the process of building and managing modern high-performance Linux clusters, eliminating complexity and enabling flexibility.
Clustered infrastructure management software will enable an organization to manage any heterogeneous high performance Linux clusters, spanning from core-to-edge-to-cloud.
Features include:
Traditionally, clustered infrastructure management was seen as a luxury for organizations that ran only the biggest of the world’s supercomputers. This is changing.
New, emerging technologies such as cloud, edge, data analytics, and machine learning have given rise to an enormous change in HPC. As a result, HPC spans more industries than ever before, and organizations are striving to achieve results quicker, more cost-effectively, and more efficiently.
Linux clusters are powering this new generation of business innovation. Powerful new hardware, applications, and software have system administrators struggling with this complexity while endeavoring to provide a system that is both flexible and reliable for end-users.
Clustered infrastructure management software eliminates complexity and enables flexibility by allowing users to deploy complete clusters over bare metal and manage them reliably from edge to core to cloud.
Clustered infrastructure management software takes the pain out of deploying and managing a supercomputing environment, empowering organizations to get the most out of their HPC platform and achieve their business goals more quickly.
HPC as we know it is changing forever. It’s no longer about having a compute-rich system running modeling & simulation applications on bare metal using a workload manager in your data center. It’s about extending that system to the cloud and to the edge. It’s about running on VMs and in containers as well as on premises. It’s about running machine learning and analytics applications. It’s about using processors and accelerators from AMD, Arm, NVIDIA, Graphcore, and others. It’s about bringing the potential of high-performance computing to EVERY industry.
This new era of high-performance computing brings with it a level of complexity that is crippling. Combining different types of applications, processors, vendor hardware, and deployment models into one system requires a level of technical knowledge and expertise that few organizations have. This is where clustered infrastructure management comes in.
Clustered infrastructure management automates the process of building and managing high-performance Linux clusters, accelerating time to value, reducing complexity, and increasing flexibility.
When choosing a clustered infrastructure management solution, make sure that it can expand its capabilities from traditional HPC to machine learning, the cloud, the edge, and container environments.
Most importantly, make sure your clustered infrastructure management software makes it easy for you to build and manage your HPC environment, so that you can focus your time and energy on things that add value to your business.
Your clustered infrastructure management software should:
Even in the simplest HPC environment, keeping track of all the activities involved in managing a cluster manually is not only complex but time-consuming. Enterprise clusters host multiple jobs and multiple tenants simultaneously, with mission critical applications competing for resources. Managing these clusters in a way that keeps them up and running efficiently is critical to business success.
When a cluster is made up of hundreds or even thousands of nodes, managing it can be a real challenge. It can be difficult to monitor all of the hardware and software components of a cluster, while optimizing the use of resources for multiple workloads. But failing to do so can leave end users waiting for their jobs to complete.
A cluster manager provides a way to monitor and manage the entire cluster, working with workload managers to allow the best use of cluster resources as defined by the organization’s policy. Bright's cluster manager can dynamically provision clusters based on demand, allowing administrators to optimize node utilization and minimize the number of nodes that sit idle waiting for a job to process.
Cluster management software provides an automated way to track and manage the cluster as a whole, improve overall application performance, optimize resource utilization, and identify problems so they can be reviewed and acted upon quickly. The ability to deploy, provision, and manage large clusters from bare metal is only the beginning of what cluster manager software can do. With performance management, system admins can:
A major challenge for sysadmins is having to manually detect failures, degraded performance, and power inefficiencies, and identify their root causes. A cluster manager can proactively monitor the cluster’s health and report anomalies as soon as it spots them. In many cases, the high availability feature of cluster management software can provide automatic failover and keep the cluster running even after a failure has taken out a critical node. A good cluster manager can often detect and initiate failover without human intervention, notify admins of the problem, help to identify the source of the problem, and provide the tools needed to get the ailing server back online.
The monitoring process, when done manually, is extremely time-consuming and potentially error-prone. The lack of visibility into clusters can make it difficult to determine:
Using an advanced cluster manager enables administrators to:
From provisioning to management, monitoring, and maintenance, a cluster manager maximizes end user and system admin productivity and time management in ways that bring top and bottom line benefits to the enterprise.
Cluster-as-a-Service (or CaaS for short) allows end-users to quickly and easily spin up large numbers of sophisticated, isolated, and secure cluster infrastructures within their private cloud environment. These cluster infrastructures have their own single-pane-of-glass management infrastructure and can easily be managed by end-users rather than by cluster administrators alone.
CaaS deploys a ready-to-use HPC cluster, preconfigured with all of the required HPC libraries, compilers, workload managers, and other tools that users need to quickly start running HPC workloads.
The interest and use of containers in HPC are exploding, and for good reason. Containers represent a new level of abstraction that elegantly solves the problem of how to get software to run reliably across multiple computing environments.
Unlike virtual machines (VMs), containers virtualize at the operating system level and allow multiple containers to run on the same OS kernel. This makes containers smaller, faster to instantiate, and lighter weight than VMs.
While containers offer a number of advantages over virtualization, deploying them reliably can bring its own challenges. Provisioning and configuring the underlying servers, deploying and managing container orchestration frameworks like Kubernetes, and monitoring all of that infrastructure can create real headaches for IT administrators.
Your clustered infrastructure management system can help. It will make it easy for you to deploy container technology, whether you choose Docker or Singularity runtime engines, or Kubernetes container orchestration.
Using a clustered infrastructure management solution to manage your container infrastructure and frameworks eliminates hassles and complexity and allows you to fire up containers alongside other applications within the cluster.
A clustered infrastructure management solution will make it easy to allocate some or all of your server infrastructure to running containers and change that allocation dynamically as demand for resources changes. The result is an efficient clustered infrastructure with high utilization rates, regardless of what kinds of jobs your user community needs.
A hybrid cloud is a type of cloud computing that combines on-premises infrastructure (also known as a private cloud) with a public cloud. Hybrid clouds are set up to enable data and applications move securely between the two environments.
When it comes to HPC and cloud, there are a number of business drivers that might encourage an organization to choose a hybrid cloud model, such as meeting regulatory and data requirements, maximizing an existing on-premise technology investment, addressing low latency issues, or accessing additional cloud resources when on-premises compute power has been exhausted.
Interestingly, hybrid clouds are also evolving to include edge workloads. Edge computing brings the computing power of the cloud to IoT devices, closer to where the data resides. By moving workloads to the edge, devices spend less time communicating with the cloud and therefore reduce latency. This means that they are often able to operate reliably offline, for extended periods. We discuss Edge in more detail, at the bottom of the page.
To take advantage of the flexibility that cloud-based HPC offers, many enterprises deploy a mix of public and private clouds in a hybrid model. While a fundamental step in realizing the full potential of a hybrid cloud is via cloud bursting, too few organizations planning on embarking on the HPC-and-analytics path appreciate its importance.
Cloud bursting is all about the dynamic deployment of applications that normally run on a private cloud into a public cloud to meet expanding capacity requirements and handle peak demands when private cloud resources are insufficient. Cloud bursting can make these private clouds more cost-efficient by eliminating the need to overbuild physical infrastructure to ensure enough capacity to meet fluctuating peaks in demand. Private clouds can be rightsized in terms of compute and storage to accommodate the ongoing demands because the peaks can be handled by a public cloud and a pay-per-use model.
There are many scenarios where businesses can benefit from cloud bursting. For example, those sectors that deal with seasonal spikes put an extra burden on private clouds. Enterprise data centers may have geographic needs where one location experiences heavy loads and must meet application-specific performance needs. Software development projects and analytics are two of the fastest-growing drivers of demand for cloud bursting. DevOps teams spin up numerous virtual machines for testing purposes that are only needed for a short time.
Overall, the availability of a public cloud offers the chance to reduce the capital cost of owning and managing excess compute capacity and storage for all kinds of workloads. By combining it with on-premises cloud resources and using cloud bursting to manage it, the public cloud serves as on-demand overflow capacity and eliminates the need for costly over-provisioning to meet temporary demand.
Hybrid clouds are particularly useful in certain sectors. Life sciences workflows, for example, generate a great deal of data on-premises for things like genomic sequencing but rely on the ability to analyze and compute the data in the cloud. This is a prime example of a temporary-use scenario. Cloud bursting is also integral to the financial sectors, which must develop predictive models based on stock market data for market risk analysis.
The ability to effectively and seamlessly manage demand and potentially bring thousands of additional cores to bear is the only way that many sectors can make this type of data analysis effective in terms of both cost and time. When demand spikes, these companies can do the cost/schedule math to figure out how much additional processing power is needed and just rent it from Amazon Web Services or Microsoft Azure.
To seamlessly manage and monitor cloud bursting of HPC workloads in a hybrid cloud requires a sophisticated cluster management solution that can integrate a wide variety of workload managers and eliminate the high learning curve across cloud platforms. This automation software solution can integrate across all workload management solutions by significantly reducing the complexity inherent to cloud bursting. This solution brings a great deal of agility, responsiveness, and simplification that saves money and time while opening up cloud compute vistas for enterprises across every sector.
To extract the most value from your HPC cluster, you need to ensure that system resources are being properly utilized. HPC system users are notorious for over-requesting resources for their jobs, resulting in idle or underutilized resources that could otherwise be doing work for other jobs. While one reason for this can be users hoarding resources to ensure they have what they need, another common reason why users over request resources is that they don’t know what resources their jobs will need to complete the job in a specified time. For administrators to ensure that their precious and expensive cluster resources aren’t being squandered, they need to get actionable details regarding how the resources are being used. More specifically, they need to know things like which jobs are using which resources, which jobs aren’t using resources that they’ve provisioned, and which users are repeatedly hoarding resources unnecessarily, as well as other things.
Workload Accounting and Reporting (aka WAR) gives cluster administrators the insights they need to answer these types of questions.
Administrators can use new insights and information to enlighten users as to the resources requirements of their jobs, facilitate chargeback reports if desired, provide visibility for management to understand how the system is being used, and provide rationale for investing in additional resources when warranted. Even if users are not actually charged for the resources they use, the visibility provides insight into how much the jobs being run costs and stakeholders can then use this information to make value judgments.
An HPC cluster consists of hundreds or thousands of compute servers that are networked together. Each server is called a node. The nodes in each cluster work in parallel with each other, boosting processing speed to deliver high-performance computing.
The provisioning of software images to nodes is a key task of any clustered infrastructure management platform. Powerful and flexible node provisioning and software image management are essential to cluster installation and management, especially for larger and more complex clusters.
Sophisticated node provisioning and image management allows you to do the following:
A clustered infrastructure management platform will allow you to monitor, visualize, and analyze a comprehensive set of hardware, software, and job metrics in various ways. Virtually all software and hardware metrics available to the Linux kernel and all hardware metrics are available to hardware management interfaces - such as IPMI.
The metrics available by default on a cluster can be categorized into three main categories:
For each of the above categories, the following subcategories are available:
Your clustered infrastructure management solution should include powerful GPU management and monitoring capabilities that leverage functionality in NVIDIA® Tesla™ GPUs to take maximum control of the GPUs and gain insight in their status and activity over time.
Sample and monitor metrics from supported GPUs and GPU Computing Systems, such as the Kepler-architecture NVIDIA Tesla K80 dual-GPU accelerator, as well as collections of GPU accelerators in a single chassis.
Examples of supported metrics should include:
The frequency of metric sampling is fully configurable, as is the consolidation of these metrics over time. Metrics will be stored in your clustered infrastructure management system’s monitoring database.
You should also expect that alerts and actions can be triggered automatically when GPU metric thresholds are exceeded. Such rules are completely configurable to suit your requirements, and any built-in cluster management command, Linux command, or shell script can be used as an action. For example, if you would like to automatically receive an email and shut down a GPU node when its GPU temperature exceeds a set value, this can easily be configured.
Cluster Health Management can also include health checks for GPU cards and GPU Computing Systems in GPU clusters. Any of the supported GPU metrics can be used in regular and prejob health checks.
The promise of machine learning is pulling every organization in every industry into the realm of HPC. Many organizations have little or no experience with these systems, and even the most experienced high-performance computing practitioners will tell you that building and managing high-performance Linux clusters is no easy task. With hundreds or thousands of hardware and software elements that must work in unison, spanning compute, networking and storage, the skills, knowledge and experience required to do this is often more than an organization can cope with.
Your clustered infrastructure management platform will offer an integrated solution for building and managing machine learning clusters that reduce complexity, accelerate time to value and provide enormous flexibility.
You should also be able to access a pre-tested catalog of popular machine learning frameworks and libraries, as well as integration with Jupyter Notebook, to ensure that data scientists can be as productive as possible and not waste time managing their work environment.
Key features:
Automated cluster provisioning and setup:
Automated problem detection and isolation:
Automated change management:
Unparalleled flexibility
Edge computing allows a business to place infrastructure in remote locations that lack IT staff to deliver new applications that take advantage of locally generated data to improve operations, services, experiences and products.
A common use case for edge computing arises from Internet of Things (IoT) applications. In these situations, sensors produce large amounts of data at remote locations that need to be processed in real time and acted on. Typically, these applications can't tolerate the network latency involved in sending data back to the cloud or other location for processing.
Another use case for edge computing occurs in applications for HPC, where computing clusters have resources located in distributed geographical regions near data sources that may span a city, a country or the globe.
In both cases, organizations find themselves needing to deploy, manage and monitor these distributed computing resources efficiently and effectively, often without the benefit of local IT staff. The solution is to deploy and centrally-manage these resources as a single clustered infrastructure.
Cluster management solutions allow organizations to deploy and centrally manage computing resources in distributed locations as a single clustered infrastructure, from a single interface. The distributed computing nodes deployed and managed by the cluster manager can be imaged to support any workload, re-imaged on the fly to support different workloads when desired, and are monitored to ensure that you always know precisely what’s going on. And when you need to add more computing capacity at any location, bring additional nodes online is quick and easy.
A cluster management solution provides the means to deploy edge servers securely over a network, or from local media. No onsite personnel are needed for network deployment, and for local media deployments, no special expertise is required. What's more, the cluster manager's deployment methodology allows for smooth installations everywhere, even to sites that are only reachable over low bandwidth connections.
For more information, please contact us or fill in the form below: