Six Talks We Would Recommend from the Hadoop Summit in Dublin


By Bright Staff | May 05, 2016 | Big Data, Hadoop



The Big Data Team at Bright Computing sent a group of inquisitive developers to Hadoop Summit 2016, which took place in Dublin, last month.

When they got back to the office to share their experiences with the rest of the team, they put together a list of their top six presentations from the event.

  1. Emer Coleman presented on Ethics in Technology, commenting on the wonderful way that data can be joined together from multiple sources, but at the same time, this makes it easier for government and unscrupulous companies to manipulate their users. 

    Programmers have come to understand that security is not an "add-on" or a "plug-in" to be bolted on later; security is something that has to be kept in mind at all times. From Emer's talk, we can expect privacy to be a similar concern; programmers must not wait for specialists to tell them what to do.

  2. Raj Mukherjee presented on using a Data Lake at the core of a life assurance business. This talk featured the architecture of the Data Lakes used by Zurich Life Insurance. The talk was useful for familiarization with terms like EDW - Enterprise Data Warehousing and how it differs from Data Lakes: Data Lakes keep all data, while EDW provides multiple sources; Data Lakes must support all users and data formats, meanwhile EDW are case-specific, and, maybe the main point, Data Lakes must be adaptive, meanwhile EDW are static. 

  3. Prasad Chalasani talked about fast distributed online classification and clustering. The aim of his talk was to present the technique he used to answer the question "What is the probability of a user seeing an ad to buy a product?". The talk introduced an online learning model which has the advantage of being able to make predictions without scanning all the input, and is a much faster solution. It can get stuck, however, in local optima, but its advantages are supposed to be overwhelming. Then, there came the introduction of Slider, the machine learning library Media Math developed for Spark. In the talk, there was a comparison between Slider and two other established Spark machine learning libraries. While those two libraries overlap in functionality, Slider covers all of their use cases. Slider will be soon released as open source.

  4. Dhabaleswar K (DK) Panda presented on accelerating Apache Hadoop through high performance networking and I/O Technologies. Remote direct memory access (RDMA) is a memory access from one computer to another, where either operating system gets involved. IB (InfiniBand) and (RDMA over Converged Ethernet) RCoE are protocols that allow RDMA. They can act as a TCP/IP replacement. The talk introduced the presenter's team, which is working on bringing together the world of HPC and Big Data. It introduced their Hadoop and Spark derivative featuring a RDMA-based data shuffle (the phase when data is transported to different locations, required, for example, to execute a connection between two tables situated in different locations). Such Spark deployment can be configured as usual, but it provides some additional configurations.

  5. Julian Hyde’s talk was about planning with polyalgebra: Bringing together relational, complex and machine learning algebra. In the presentation we were introduced to Apache Calcite, an SQL language for querying Non-SQL. Calcite uses CSV files as intermediate representation, making a directory of CSV files appear to be a schema containing tables. Then, it provides an SQL interface to those CSV files. Adapters can be plugged in, so they are readily available through an SQL interface.

  6. Finally, Casey Stella talked about using natural language processing on non-textual data with MLLib. Casey is a data scientist and showed the results of his research analyzing medical data in the US. He used word2vec methods to transform medical-related n-grams into algebraic vectors. Then, using the distance among those vectors, he could find relationships among symptoms and diseases that aren't described in the literature. The relationships are fed back to researchers to further investigate them.

Overall, the team agreed that this was a very worthwhile 2 days, and the Bright developers came home brimming with new ideas to feed into our big data development roadmap.