Hadoop 101 : Do You Need Hadoop To Manage Your Data?


By Lionel Gibbons | February 10, 2017 | Big Data, Hadoop




Many businesses today are finding that the amount of data they generate and store is rising exponentially. If that's the case with your organization, you're probably searching for ways to efficiently and cost-effectively manage all that information. One popular option for doing just that is a system called Hadoop.

Hadoop is widely used by both large and small companies because of its ability to handle huge amounts of data. But is it the right solution for your business? Here are some things to consider in making that choice.

What is Hadoop?

Hadoop is a software framework that is optimized for the distributed processing of very large datasets. Its two main features are the Hadoop Distributed File System (HDFS), which handles storing files, and MapReduce, which processes the stored information.

HDFS can easily store terrabytes of data using any number of inexpensive commodity servers. It does so by breaking each large file into blocks (the default block size is 64MB; however the most commonly used block size today is 128MB). Copies of each block are stored on several (usually three) different servers to ensure against loss of data due to hardware failure. It's the job of HFDS to store, manage, and provide access to all the data blocks that make up a file.

MapReduce allows your applications to process data blocks in parallel rather than serially. It does so by running segments of your application on the same server where the particular data blocks it needs reside. Suppose, for example, that your application needs to find the total amount of payments a customer has made over a ten-year period. That could mean sorting through a huge number of records to pick out the required data. Doing this in serial fashion would take a lot of time. But MapReduce would assign each server on which a block of your file resides to process just that block, while other servers, perhaps dozens of them, do the same calculations on other blocks at the same time. MapReduce then consolidates all those results. This approach allows your application to get the answer it needs much faster.

Is Hadoop a good match for your data and your budget?

The primary function of Hadoop is to facilitate quickly doing analytics on huge sets of unstructured data. In other words, Hadoop is all about handling "big data." So the first question to ask is whether that's the kind of data you are working with. Secondly, does your data require real-time, or close to real-time analysis? Where Hadoop excels is in allowing large datasets to be processed quickly.

Another consideration is the rate at which your data storage requirements are growing. A big advantage of Hadoop is that it is extremely scalable. You can add new storage capacity simply by adding server nodes in your Hadoop cluster. In theory, a Hadoop cluster can be almost infinitely expanded as needed using low cost commodity server and storage hardware.

If your business faces the combination of huge amounts of data, along with a much less than huge storage budget, Hadoop may well be the best solution for you.

This is just a brief overview of how Hadoop may be able to help your company achieve its data storage and processing objectives. If you'd like to know more, please contact us.


FREE EBOOK: A BRIEF UPDATE ON HADOOP A look at how Hadoop is changing the way enterprises handle business intelligence
Read eBook