If you are curious to know more about Apache Spark™ and how it can be used for large-scale data processing, check out our latest vodcast series. As a fast and general processing engine compatible with Hadoop data, Spark can run in Hadoop clusters through YARN or in a standalone mode. For both batch processing and new workloads like streaming, interactive queries, and machine learning, Spark has a lot going for it.
In Part I, a Bright expert will discuss Bright Cluster Manager for Big Data and walk you through the fully integrated support for Apache Spark now included. The vodcast highlights how the integration with Spark can directly help end users, for instance the flexibility offered by running Apache Spark with or without the Hadoop Distributed File System (HDFS).
In Part II, the vodcast digs a bit deeper into the details of Apache Spark, and offers eight reasons why Spark is gaining such a following. For example, Spark handles iterative algorithms and interactive mining tools much more efficiently than MapReduce. Also, Spark provides a converged analytics platform, creating a comprehensive engine for big data analytics. It lets users move rapidly from building simple interactive apps to building sophisticated distributed apps.
Part III discusses how using Apache Spark without Hadoop means customers can have a comprehensive platform for big data analytics, while using a variety of HDFS alternatives. We know that installing a brand-new file system to get a solution for big data can be a real problem, considering the significant amount already invested in high-volume, distributed and scalable parallel file systems. We discuss some of the alternatives available, like Amazon S3, OpenStack Swift, and IMB’s GPFS, among others, and explain why users might want to make this choice.