By Michele Lamarca | September 14, 2016 |
Bright Cluster Manager for Big Data Version 7.3 has a lot of great new features. One we are really excited about is support for Apache Cassandra, a top level Apache distributed database project born at Facebook about a decade ago. Apache Cassandra is being hailed as a great solution for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.
In some cases, Cassandra can serve as an interesting alternative to Apache Hadoop’s HDFS. If you want to store a lot of data for big data applications, Hadoop’s HDFS can definitely satisfy the need. But if high availability (say 100 percent availability) is your goal, Cassandra may be the better option. It offers continuous availability, along with linear scale performance, operational simplicity, and easy data distribution across multiple data centers and cloud availability zones.
With Hadoop, you have NameNode and DataNodes (i.e., masters and workers). If the master goes down, everything goes down and the cluster cannot work. The problem is partially mitigated if you use two NameNodes, so you would get high availability for HDFS. Compare that to Cassandra, which has no masters. All nodes have exactly the same role; if one or more nodes go down, they are replaced by other nodes. So, if you want to store data, and be sure it is always available (especially if it is spread around the globe), Cassandra is the better solution. For example, it’s already being used by eBay, Netflix, and other major companies with exposed data all around the globe who cannot afford to have downtime Both eBay and Netflix have complex environments and use both Hadoop and Cassandra, depending on the specific purpose.
An interesting facet of the puzzle is that Cassandra often gets used with other components of the Hadoop ecosystem, mostly with Apache Spark. Hadoop and Spark can already talk to each other – and now we have the addition of Cassandra. Data coming from a stream (for example, a stream of tweets) could all pass through Kafka, get analyzed by Spark, and then the user could decide to store the results inside either Hadoop or Cassandra.
In the past, if you wanted to implement both Cassandra and Hadoop, you’d have to rely on manual deployment. No more! Support for Cassandra is now integrated into Bright Cluster Manager’s graphic user interface (GUI) in Version 7.3, along with a wizard that makes it easy to deploy Cassandra. We also support the principal Cassandra maintenance operations. Support is currently available for one data center; defining multiple data centers will be available in the next major release.
Cassandra is a great alternative when fault tolerance and continuous availability are high priorities. Many companies and research institutions have already deployed it, and many could benefit from having a system that uses Hadoop and Cassandra at the same time. Bright users can deploy them together within a complex environment.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 711315 Bright Beyond HPC