Spark'ing Scope Creep
I had the fortunate opportunity to present in the disruptive-technology track at the 2015 Rice Oil and Gas HPC Workshop during the first week of March. What I presented during my two-minute drill in this session ended up being much more disruptive than I anticipated.
You see, when I submitted my abstract for the event in mid-January, my intention was to focus on the various possibilities for parallelism enabled through Hadoop; as I wrote in a blog in January:
What makes it interesting at this most-fundamental level, is that Hadoop is also about teaming compute and data locally - in other words, workloads are mindfully executed with innate awareness of topology.
The scope of my investigation was originally restricted to what I hoped might be some novel exploitation of HDFS and YARN in support of seismic processing, specifically Reverse-Time Migration (RTM). In the month-and-a-half ramp up to the Rice workshop, I validated that there were numerous possibilities for enhancing RTM, and that this was still of relevance to the industry as a whole. What I did not anticipate, however, was that the introduction of Apache Spark would prove to be far more disruptive than I originally imagined.
Spark’s Disruptive Potential Beyond Energy Exploration
My poster summarizes eight key points that make Spark disruptive for energy exploration and other industries:
Spark can likely make use of your existing file systems. Although Spark can make use of HDFS, it can be completely decoupled from it. IBM’s GPFS is an existing alternative, and Intel is working on a Lustre-based offering. There are existing cloud-savvy alternatives as well - more on this below. If your investment is in a different file system, chances are it can be integrated for use with Spark. Reason 2 of my top 8 on why Spark is so hot provides additional context for this gating consideration.
Spark will integrate with your HPC workload manager. In this case the question is not if, but when. I can assure you that all providers of traditional workload managers (WLMs) for HPC have integration with Spark under consideration. YARN may provide the gateway, or WLM providers may decide to build their own ‘adapters’. In the interim, your existing HPC WLM will need to allocate a static pool of resources for Spark to make use of. If something more dynamic is essential, Docker-contained Spark instances might suit your initial needs. Reason 3 of my top 8 will provide you with additional context and possibilities.
Spark can be deployed alongside your HPC cluster. As is the case with file systems and WLMs, organizations with production workflows serving paying customers don’t have the cycles to spin up another separate-and-distinct environment for technical computing. And fortunately, with Bright Cluster Manager, they don’t need to. Bright deploys Spark onto bare-metal in under 60 minutes. On our short-term roadmap, is a plan to monitor and manage Spark - something we’re already doing for HPC and Hadoop. Ours is a unique capability in this regard.
You can likely use your existing code with Spark. Spark uptake doesn’t imply an all-or-nothing proposition. For example, existing applications written in C or C++ can be invoked using JNA. In addition to bindings for Java and Python, Spark is Scala-native. Are you one of those organizations already deriving value from GPU-enabled applications? Not a problem, as Spark has already been integrated with these accelerators. (As for Fortran, and in terms of full disclosure, further research is required.)
Spark will introduce an analytics upside. In my seismic-processing example, the converged analytics platform provided by Spark (reason 5 of my top 8) provides alternatives for establishing coherence between time series. It is anticipated that Spark’s analytics upside will be much more about introducing new capabilities than computational performance - e.g., using machine learning (Spark’s MLLib) to reduce the uncertainty inherent in inverse problems.
You could Spark up a cloud. Databricks, the for-profit entity comprising those originally responsible for creating Spark, have developed a PaaS offering. Organizations seeking greater autonomy over their IT infrastructure, however, may prefer to stand up platforms for Spark in public (e.g., Amazon Web Services) or private (e.g., OpenStack) clouds - see, for example, Prof. Huang’s seismic data analytics cloud at Prairie View A&M University. Noteworthy in these cloud contexts is that Spark already supports Amazon S3 as well as OpenStack Swift from a storage perspective. Bright Cluster Manager provisions, monitors and manages clusters within AWS. In addition, Bright can deploy, monitor and manage OpenStack. And into these distinct classes of clouds, Bright can deploy Spark, as well as the rest of the software stack you require - including your in-house applications and toolchains.
Spark is not a transient phenomena. The momentum behind Spark is substantial. Open-source Spark has a captivated and fully engaged audience, is being adopted by name-brand corporations, and a track record of delivering results lightning fast. For a deeper discussion, review reasons 7 and 8 of my top 8 take on why Spark is so hot.
Spark continues to improve. In my seismic-processing example, there is already ample justification for a closer look at Spark. However, it’s important to remember that Spark is a work-in-progress. In Spark Release 1.3.0, a Dataframes API was introduced in the middle of March. According to a Databricks blog post from mid-February: “In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.” Although the utility of Spark’s support for dataframes needs to be worked out, there is optimism that this extension to the Spark API will be of enhanced value.
Although Apache Spark presents a number of transformative possibilities for seismic processing, it is not my intention to represent Spark as being applicable to each and every use case currently serviced purely by HPC. In addition to seismic processing, Spark can and should be, or is being, exploited for:
- Monte Carlo simulations in financial risk analysis and high energy physics
- Sequencing genomes in bioinformatics (e.g., SparkSeq)
- Real-time medical imaging
- Rendering images in the creation of digital content
As the possibilities for Spark continue to develop, there’s no question that a converged infrastructure for HPC AND Big Data Analytics is required. With Bright Cluster Manager, in less than 60 minutes, organizations can immediately begin their investigation of possibilities for next-generation applications using converged clusters and clouds.