By Michele Lamarca | April 13, 2016 | Apache Spark
As you’ve probably heard by now, Apache Spark™ is a fast general processing engine which can be used for large-scale data processing. Bright Cluster Manager has supported Spark since version 7.1, but a number of recent enhancements were made to the Bright support for Spark in Version 7.2 which improve functionality and ease of use for our users.
The greatest difference between the support for Spark in Bright Cluster Manager Version 7.2 when compared to Version 7.1, is the ability to deploy an instance of Spark independently of Hadoop. When we first introduced support for Spark, it was as an additional tool to deploy on top of a Hadoop instance. While that is still useful for many users, for others – for example, users collecting and assessing data from a stream source – Spark is better used independently of Hadoop. Bright users who have no need for HDFS can now benefit from the power and flexibility of Spark, whether they are collecting data from a stream that’s continually in motion, or if they want to query an external database (e.g. MySQL) without requiring additional storage.
Improved Installation Wizard:
In addition to supporting the deployment of Spark independent of Hadoop clusters, Bright Version 7.2 also makes it much easier to deploy Spark, generally. Version 7.2 includes a visual Spark deployment wizard within the Bright GUI, rather than a command-line script. This wizard auto-fills nearly all of the required fields for deployment with predefined values, making it so users only need to select where to deploy Spark and check the settings before deployment starts.
All told, deployment of Spark can take less than 3 minutes with the new graphical wizard, including extensive testing to ensure that the deployment is running smoothly. With a relatively complex, open-source tool like Spark, Bright’s simple installation wizard makes deployment much simpler. User’s choices can be saved to an XML file, which can then be reused as a blueprint for additional Spark deployments.
Enhanced Support for additional Tools:
The final major enhancement to Bright’s support for Spark in version 7.2 is added support for related tools like Apache Zeppelin, Alluxio (formerly Tachyon), and Apache Kafka. Zeppelin is a notebook-type web application in which users can both perform calculations and intersperse explanations and notes alongside code. Zeppelin allows users to save these note pages locally and share with others. Overall, it lowers the barrier to entry for using Spark, since new users don’t need a dedicated environment in order to develop a Spark application.
Alluxio is an in-memory computing tool that helps speed up computations for Spark, especially in streaming applications where data is constantly being accumulated and analyzed. Similarly, Kafka makes it easier to work with streaming data, working like a “sink” between the data stream and Spark to accumulate and pass on data.
Improved support for Kafka is underway for the next Bright Cluster Manager release, as is support for Apache Cassandra, which can work with Spark and Kafka in streaming applications to store the results of computations in a scalable and high available database. Look for these updates in Bright Version 7.3, and in the meantime, test out the enhanced support for Spark in 7.2.