Apache Hadoop has been around for longer than many people realize. It’s now over a decade old, and has come a very long way since 2002. GigaOm’s Derrick Harris did a good job covering Hadoop’s history in this series of posts titled The history of Hadoop: From 4 nodes to the future of data.
There is now a thriving community built up to support and extend Hadoop, and it has spawned a bevy of startups seeking to address various aspects of Hadoop ecosystem for profit. But despite all the progress, there’s still an “elephant in the room” (sorry I couldn’t resist) that nobody talks much about — cluster management.
Now, before you jump all over me for not mentioning Ambari and other Hadoop management solutions, let me clarify what I mean. You say cluster, and I say…ummm, well I say cluster too. But we’re probably talking about different things. In this case, by “cluster” I mean the servers, networking, operating systems, and other infrastructure-related pieces that make up the platform Hadoop runs on.
While Ambari and other management solutions focus on Hadoop-related software, few address the configuration, installation, and management of the hardware and software Hadoop needs to have in place before you can even install it.
Let’s face it, most discussions about Hadoop deployment start in the middle of the story. They assume you’ve already got a working cluster to install Hadoop on, but a lot has to happen before you get there. And what happens after you’re up and running? How will you keep tabs on what’s going on down in the engine room? If you want to keep those Hadoop jobs humming along, you’ll need to do a good job of monitoring and managing that cluster.
So how do you get from bare hardware to the point where you have a functioning cluster of servers — usually running Linux — that is fully networked and loaded with all the right software?
You can build the systems by hand, installing the right combination of operating system and application software, and configuring the networking hardware and software properly. You might even use sophisticated scripting tools to help speed up that process. But there is a better way. Cluster managers were born to fill that gap.
It’s true you can build and manage your Hadoop systems without a cluster manager. But there are a number of advantages to using a cluster manager. First, they can save you a lot of time up front, when you first rack the servers and load them with software. In fact, cluster managers are designed to automate the entire process once the servers are racked and cabled. They will do hardware discovery, read the intended configuration of each machine from a database, and load the software onto them all — whether you have 4 nodes or 4000. And they will do it repeatedly, which comes in handy if you are deploying a series of clusters to branches of your organization around the world. Automating the process reduces the number of configuration errors and missteps that can occur whenever a complex, multistep procedure is involved. A good cluster manager will let you go from bare metal to a working cluster in less than an hour. Think of the savings!
And the usefulness of cluster managers doesn’t end once the Hadoop cluster is deployed. While the Hadoop management software available in the leading distributions from Cloudera, Hortonworks, MapR and others do a great job of monitoring and managing Hadoop, they know very little about the underlying cluster the Hadoop systems exist on. Cluster managers provide the missing pieces of monitoring and managing the infrastructure so that you can keep things running properly. They monitor things like CPU temperatures and disk performance, and can alert operators to problem before they snowball out of control.
So the next time you’re discussing Hadoop solutions for your organization, don’t forget to consider the foundation those solutions rest on. Once you put your cluster into production, you’ll want to be able to manage the entire stack. Consider the operational procedures that need to happen BEFORE you install Hadoop, and the systems you’ll need to have in place to keep your cluster healthy AFTER Hadoop is up and running. I promise you will be glad you took the time to “fill the gaps” down the road.ucumari via photopin cc