Homegrown Cluster Management…Just because you can, doesn’t mean you should Pt. 1


By Bill Wagner | March 19, 2019 | Bright Cluster Manager



The historical thinking behind building and maintaining your own HPC cluster management solution using open source tools goes something like this:  “We have smart people that can build this. We have limited capital budget. Commercial cluster management software costs money, but open source tools like xCAT, Rocks, OpenHPC and others are free. If we build a solution ourselves using open source tools, we can use the savings to buy more hardware.”  

Let’s tease this thought process apart.

First, without even knowing your organization or your circumstances, I’ll concede the first three points: 1) You undoubtedly have wickedly smart people that are capable of building some form of cluster management system; 2) You have limited capital budget; 3) Commercial cluster management software costs money.  With 75% of the argument already won without uttering a single word, the decision to build your own cluster management solution might appear to be the right one.

Now let’s look at the fourth point … “open source tools are free.”  While it’s true that open source tools like xCAT, Rocks, OpenHPC, etc. have no license fees associated with them, I’m guessing that your team is spending time developing automation scripts to make those tools do what you want them to do.  I’m also guessing that you are using other open source tools like Nagios or Ganglia to monitor your cluster, which means that your team is developing additional scripts to monitor the cluster.  Also, to make life less hellish for your cluster administrator, I’m sure that your team is also developing some level of integration between all of the disparate tools that you’ve pulled together to create this “free” cluster management solution.  And since developing scripts and integrations is never a one-time ordeal, the team is also spending time maintaining them as well

Despite the fact that you’ve got some wickedly smart people working on all of this, I’m guessing that it has taken (calendar) time to test, troubleshoot and ultimately get everything working properly.  I guess you could consider this approach “free” if you don’t consider the salaries of the people doing all of this work to be a real cost.  Hopefully, you’re paying those people REALLY well, because it sounds like you’re dependent on them to keep your cluster humming. 

So now that you’ve got it all working correctly, you’re home free!  What could go wrong?  As anyone familiar with clusters knows, things in fact, DO go wrong.  Hardware fails, software needs updating, networks and storage encounter problems, performance mysteriously degrades, more/new hardware needs to be added and monitored, the list goes on.  Hopefully, the “free” cluster management solution you’ve created will allow your administrator to quickly identify and address these issues.  But again, I guess you can consider this do-it-yourself approach to be more cost-effective than using commercial cluster management software if you don’t consider the extra time and effort required by the administrator, or additional downtime to users, to be a real cost to the business.

As I’m sure you’ve guessed by now, the decision to build vs. buy your HPC cluster management solution is no longer a tactical one, but a strategic one. Stay tuned for next weeks blog as I offer my insights into some key variables that make this decision strategic and why buying your HPC cluster management solution can make all the difference.