Seismic migration using Hadoop: How did I get here?


By Ian Lumb | January 22, 2015 | Hadoop, Big Data Analytics, Oil & Gas, Seismic Processing, Seismic Migration, Seismology, Petroleum Exploration



Aspiring seismologist

I always wanted to be a seismologist.

Scratch that: I always wanted to be an astronaut. How could I help it? I grew up in suburban London (UK, not Ontario) watching James Burke cover the Apollo missions. (Guess I’m also revealing my age here!)

Although I never gave my childhood dream of becoming an astronaut more than a fleeting consideration, I did pursue a career in science.

As my high-school education drew to a close, I had my choices narrowed down to being an astronomer, geophysicist or a nuclear physicist. In grade 12 at Laurier Collegiate in Scarboro (Ontario, not UK … or elsewhere), I took an optional physics course that introduced me to astronomy and nuclear physics. And although I was taken by both subjects, and influenced by wonderful teachers, I dismissed both of these as areas of focus in university. As I recall, I had concerns that I wouldn’t be employable if I had a degree in astronomy, and I wasn’t ready to confront the ethical/moral/etc. dilemmas I expected would accompany a choice of nuclear physics. Go figure!

And so it was to geophysics I was drawn, again influenced significantly by courses in physical geography taught by a wonderful teacher at this same high school. My desire to be a seismologist persisted throughout my undergraduate degree at Montreal’s McGill University where I ultimately graduated with a B.Sc. in solid Earth geophysics. Armed with my McGill degree, I was in a position to make seismology a point of focus.

But I didn’t. Instead, at Toronto’s York University, I applied Geophysical Fluid Dynamics (GFD) to Earth’s deep interior - mostly Earth’s fluid outer core. Nothing superficial here (literally), as the core only begins some 3,000 km below where we stand on the surface!

Full disclosure: In graduate school, the emphasis was GFD. However, seismology crept in from time to time. For example, I made use of results from deep-Earth seismology in estimating the viscosity of Earth’s fluid outer core. Since this is such a deeply remote region of our planet, geophysicists need to content themselves with observations accessible via seismic and other methods.

All this to say that I have been a seismologist-from-a-distance -- quite a distance, most of the time!

BUT … I always wanted to be a seismologist.

De facto seismologist


Finally, the stage is set for me to assume the position -- of being a seismologist, that is. How so? Well, in my day job at Bright Computing, I’ve been exposed to Big Data Analytics. For many, this equates to making use of Hadoop to analyze data. Hadoop is proving to be a fine example of a disruptive technology. That being the case, I thought I’d attempt to apply Hadoop in processing seismic data. Specifically, the idea is to recontextualize seismic migration using Hadoop. Even more specifically, Reverse-Time Migration (RTM) using Hadoop.

RTM (or at least migration of some type) is interesting as it’s an essential step in processing seismic-reflection data. Whether the intention is to use seismic-reflection data for petroleum exploration, subsurface engineering, deep-crustal studies or some other purpose, migration removes the noise created by complex geological structures like folds, faults and domes.

In my travels in the Houston oil patch, working for Platform Computing, Scali, Allinea Software and now Bright Computing, I’ve learned that there are numerous existing implementations able to perform RTM on seismic-reflection data. Of key interest to me is how parallelism fits into the implementation. Is embarrassing parallelism exploited to process chunks of data digestible by the compute infrastructure? Is parallelism exploited in the algorithms that implement RTM at the code level? Or is it some combination of the two? Are GPGPUs used, or just CPUs?

Hadoop has the potential to be interesting in this context of RTM of seismic-reflection data. At the outset, Hadoop is all about a distributed file system known as HDFS that is effectively limitless in scale. This is table stakes in terms of processing seismic data since huge data volumes have been the norm since long before Hadoop and the term Big Data were even conceived.

If Hadoop was only about a distributed, high-capacity file system, that would not be enough to warrant serious consideration in an  RTM context, as there are numerous alternatives -- alternatives that have a well-established track record for applicability in the oil patch. What makes it interesting at this most-fundamental level, is that Hadoop is also about teaming compute and data locally - in other words, workloads are mindfully executed with innate awareness of topology. Even in the case of simple Map-Reduce workloads, the locality of data and compute resources is a core competence. Subsequent improvements (e.g., YARN) enable additional capabilities.

Counterintuitive to existing practices in HPC, and by association, the oil patch, compute-data locality is achieved in Hadoop by treating storage as a resource to be used freely. In fact, Hadoop takes advantage of affordable storage resources to boost reliability by replicating data across multiple physical disks distributed across multiple physical systems across the entire infrastructure. Thus Hadoop delivers a distributed, high-capacity, reliable parallel file system optimized for compute-data locality. Now we have, in principle, something that might be of more interest to those needing to process reams and reams of seismic-reflection data via RTM.

This is just the beginning … and the context for an initiative I’m currently involved in. I’ll be sharing my findings here, but ramping up to a presentation at the Rice Oil & Gas Workshop in March. The abstract I wrote for the Rice workshop is as follows:

RTM using Hadoop: Is There a Case for Migration?

Reverse-Time Migration (RTM) is a compute-intensive step in processing seismic data for the purpose of petroleum exploration. Because complex geologies (e.g., folds, faults, domes) introduce unwanted signal (aka. noise) into recorded seismic traces, RTM is also an essential step in all upstream-processing workflows. The need to apply numerically intensive algorithms for wave-equation migration against extremely large volumes of seismic data is a well-established industry requirement. Not surprisingly then, providers of processing services for seismic data continue to make algorithm development an ongoing area of emphasis. With implementations making use of the Message Passing Interface (MPI), and variously CUDA for programming GPUs, RTM algorithms routinely exploit the processing of large volumes of seismic data in parallel. Given its innate ability to topologically align data with compute, through the combination of a parallel, distributed high volume filesystem (HDFS or Lustre) and workload manager (YARN), RTM algorithms could make use of Hadoop. Given the current level of convergence between High Performance Computing (HPC) and Big Data Analytics, the barrier for entry has never been lower. At the outset then, this presentation reviews the opportunities and challenges for Hadoop’izing RTM. Because recontextualizing RTM for Big Data Analytics will be a significant undertaking for organizations of any size, the analytics upside of using Hadoop applications as well as Apache Spark will be also considered. Although the notion of Hadoop’izing RTM is at the earliest of stages, the platform provided by Big Data Analytics has already delivered impressive results in processing large-scale seismic event data via waveform cross correlation (e.g., Addair et al., 2014,

If you have any thoughts on RTM using Hadoop please feel free to comment here, or connect with me using any other channel.